We evaluate the reasoning abilities of OpenAI's newest model: o3
If you have ever asked ChatGPT to crunch a proof, you know the frustration of watching a language model do mathematics. Over the past few weeks we ran two OpenAI models, o1-pro and the newest o3, through hundreds of combinatorics, calculus, and graph-theory challenges. Here’s a quick tour of where the upgrade really matters, where it merely feels better, and why you still need to carefully double check its outputs.
Long-form focus. o1-pro can keep about a page of mathematics in its head before earlier lemmas start slipping through the cracks. o3 comfortably remembers two or three pages, so you can develop a multi-step proof without constantly pasting definitions back in.
Numerical sanity checks. Ask o3 to solve an equation or prove a new lemma and there’s a good chance it will pause, spin up a quick Python snippet, and test the claim on a toy example before proceeding. That habit catches a lot of small mishaps automatically. o1-pro will run code too, but only if you nudge it.
Creative ideas. When o1-pro hits a roadblock it usually offers one standard technique and waits for directions. o3 tends to brainstorm: “We could try an eigenvalue interlacing approach, or a flow-cut argument, or maybe a probabilistic deletion trick.” Even if you ignore half the suggestions, the creative breadth is impressive.
Structured proof process. Instead of dumping a wall of LaTeX, o3 often opens with a roadmap, proves Lemma 1, stops to critique its own bound, and only then pushes on to the next step. The structure feels a lot like working with a diligent grad student who narrates what they’re doing as they go.
Silent sign flips. o3’s most common blunder is innocent: somewhere in the algebra a \( \geq \) becomes \( \leq \) or an absolute value is dropped. Because the model carries a longer train of thought, that little slip can invalidate an entire page before the contradiction finally surfaces. Numeric spot-checks still struggle to catch algebra errors; you have to carefully double check any algebra.
However, remember that most of our hardest problems were thrown at o3 after it launched. We never gave o1-pro tests of this difficulty, so the raw gap may be smaller than our anecdotal stories suggest. Keep that in mind if you see screenshots comparing the two.
Audit the algebra. After any multi-line inequality manipulation, ask: “Restate every inequality direction you used.” It forces the model (and you) to sanity-check signs.
Checkpoint lemmas. Get o3 to freeze intermediate results in bold boxes before moving on. If the next step collapses, you know where to rewind. We will discuss this technique in detail in a forthcoming blog post.
Let it code, but verify. o3’s Python snippets are usually correct, but run them yourself and glance at the output.
Use o1-pro for quick-and-dirty, o3 for deep dives. The lighter model is snappier for homework-level algebra; the heavier one pays off when you need genuine research scaffolding.
o3 represents a real advance in mathematical workflow: longer attention span, built-in numerical experiments, and richer brainstorming. However, the core symbolic manipulation capabilities are still error-prone. Treat every ChatGPT proof, especially the elegant ones, as a draft. Your job as a human is to supply a level of rigor that even the smartest language model hasn’t mastered (yet).