Prompt Inversion > Blog > o3 vs o1-pro: What We Learned After 1,000+ Math Prompts

April 28, 2025

o3 vs o1-pro: What We Learned After 1,000+ Math Prompts

We evaluate the reasoning abilities of OpenAI's newest model: o3

If you have ever asked ChatGPT to crunch a proof, you know the frustration of watching a language model do mathematics. Over the past few weeks we ran two OpenAI models, o1-pro and the newest o3, through hundreds of combinatorics, calculus, and graph-theory challenges. Here’s a quick tour of where the upgrade really matters, where it merely feels better, and why you still need to carefully double check its outputs.

1. Where o3 really shines

Long-form focus. o1-pro can keep about a page of mathematics in its head before earlier lemmas start slipping through the cracks. o3 comfortably remembers two or three pages, so you can develop a multi-step proof without constantly pasting definitions back in. 

Numerical sanity checks. Ask o3 to solve an equation or prove a new lemma and there’s a good chance it will pause, spin up a quick Python snippet, and test the claim on a toy example before proceeding. That habit catches a lot of small mishaps automatically. o1-pro will run code too, but only if you nudge it.

Creative ideas. When o1-pro hits a roadblock it usually offers one standard technique and waits for directions. o3 tends to brainstorm: “We could try an eigenvalue interlacing approach, or a flow-cut argument, or maybe a probabilistic deletion trick.” Even if you ignore half the suggestions, the creative breadth is impressive.

Structured proof process. Instead of dumping a wall of LaTeX, o3 often opens with a roadmap, proves Lemma 1, stops to critique its own bound, and only then pushes on to the next step. The structure feels a lot like working with a diligent grad student who narrates what they’re doing as they go.

2. Shortcomings to watch for

Silent sign flips. o3’s most common blunder is innocent: somewhere in the algebra a \( \geq \) becomes \( \leq \) or an absolute value is dropped. Because the model carries a longer train of thought, that little slip can invalidate an entire page before the contradiction finally surfaces. Numeric spot-checks still struggle to catch algebra errors; you have to carefully double check any algebra.

However, remember that most of our hardest problems were thrown at o3 after it launched. We never gave o1-pro tests of this difficulty, so the raw gap may be smaller than our anecdotal stories suggest. Keep that in mind if you see screenshots comparing the two.

3. Practical tips for power users

Audit the algebra. After any multi-line inequality manipulation, ask: “Restate every inequality direction you used.” It forces the model (and you) to sanity-check signs.


Checkpoint lemmas. Get o3 to freeze intermediate results in bold boxes before moving on. If the next step collapses, you know where to rewind. We will discuss this technique in detail in a forthcoming blog post.


Let it code, but verify. o3’s Python snippets are usually correct, but run them yourself and glance at the output.


Use o1-pro for quick-and-dirty, o3 for deep dives. The lighter model is snappier for homework-level algebra; the heavier one pays off when you need genuine research scaffolding.

4. The bottom line

o3 represents a real advance in mathematical workflow: longer attention span, built-in numerical experiments, and richer brainstorming. However, the core symbolic manipulation capabilities are still error-prone. Treat every ChatGPT proof, especially the elegant ones, as a draft. Your job as a human is to supply a level of rigor that even the smartest language model hasn’t mastered (yet).

Recent blog posts

LLMs

o3 vs o1-pro: What We Learned After 1,000+ Math Prompts

We evaluate the reasoning abilities of OpenAI's newest model: o3

April 28, 2025
Tanay Wakhare
Read more
LLMs
Security

Next-Gen Jailbreaks and How to Defend Against Them

We dive into the latest generation of LLM jailbreaks and the defense techniques designed to counter these threats.

April 21, 2025
Rimon Melamed
Read more
LLMs

Evaluating GPT o1-pro with PhD Math

We give several examples of GPT o1-pro solving complex mathematical tasks and demonstrating PhD-level ability.

April 14, 2025
Tanay Wakhare
Read more