AI Just Found a 20-Year-Old Error in a Major Physics Paper. Three of Our Reviews Flagged the Same Neighborhood.

2026-05-04 - Adam Murphy

AI Just Found a 20-Year-Old Error in a Major Physics Paper. Three of Our Reviews Flagged the Same Neighborhood.

Lean caught a 20-year-old error in a widely-cited physics paper. We ran the same paper through TheoryOfEverything.ai three times — every run flagged the exact equations where the flaw lives.

In March 2026, Joseph Tooby-Smith at the University of Bath did something unusual. He took a widely-cited 2006 physics paper — Maniatis, von Manteuffel, Nachtmann, and Nagel's "Stability and Symmetry Breaking in the General Two-Higgs-Doublet Model" (Eur. Phys. J. C 48, 805–823) — and started translating it into Lean, an interactive theorem prover normally used by mathematicians to formally verify proofs.

This was supposed to be a tick-box exercise. The paper had stood for two decades, sat as a standard reference for THDM stability analysis, and survived peer review at a respected journal. Tooby-Smith was just adding it to PhysLib, a growing project that aims to formalize known physics results into machine-checkable proofs.

Lean refused to compile.

Theorem 1 — the central stability result of the paper — could not be derived from the arguments given. The flaw was real, the authors acknowledged it, and an erratum is now in progress. The full write-up is on arXiv as 2603.08139, and it's worth reading in full.

This is, as far as we know, the first non-trivial error in a physics paper found through formalization. It will not be the last.


Why this matters

Tooby-Smith's paper contains a sentence that should make every working physicist pause:

"It was one of the first papers where formalization was attempted, which raises the uncomfortable question of how many physics papers would not pass this higher level of scrutiny."

The cultural difference between mathematics and physics matters here. Mathematicians treat proof steps as load-bearing — every line has to be justified. Physicists, working in a results-oriented field with messy real-world phenomena to chase, often compress proofs and rely on physical intuition to bridge gaps. Most of the time, this works fine. Sometimes it doesn't, and the gap goes unnoticed for twenty years because no one ever forces the proof through a system that demands every step.

Lean is one such system. It does not care about intuition. It does not care about reputation. It does not care that the paper has been cited hundreds of times. If a step is missing, it does not compile.

The catch, which Tooby-Smith also notes, is that formalization is expensive. You need a corpus of formalized physics to train against, and that corpus is small. He estimates "a million lines of physics" would be a reasonable target — and that's a long way off.

So formalization is one tool. It's slow, rigorous, and binary. The question is what other tools belong alongside it.


We ran the same paper through TheoryOfEverything.ai

We took the original 2006 paper and ran it through our multi-agent review pipeline — ten specialists across seven models, three providers, with a coordinator synthesizing the result. We did this blind. No hints about Tooby-Smith's finding, no foundational assumptions pointing the agents at Theorem 1, no special prompting.

Then we did it twice more, including once after a general improvement to our system prompts that made our math specialists more direct about flagging compressed derivations.

All three runs approved the paper. Average scores: 4.0, 3.8, 4.0 out of 5. Internal consistency held high. Mathematical validity scored 4/5 every time.

But here's what's interesting. The Math/Logic Specialist flagged the same region of the paper in every single run:

  • Run 1: "Some key derivational steps are compressed, particularly for exceptional solutions and degenerate cases."
  • Run 2: "Some key derivations are compressed, particularly the exceptional solution analysis and minimization formulas like equations (4.39)-(4.40)."
  • Run 3: "Several key derivations are compressed or omitted, particularly the exceptional-case conditions (equations 4.39-4.42) and mass matrix expressions (7.19-7.20)."

Equations 4.39 through 4.42 are the boundary and exceptional-case stationary-point conditions inside Theorem 1. That is exactly the part of the paper Lean broke on. Three independent runs, three different prompt configurations, and the math specialist landed on the same neighborhood every time — without ever being told where to look.

It just stopped one step short of saying "and therefore Theorem 1 is invalid."


What this tells us about AI peer review

This is, to be honest, the result we expected. Multi-agent prose review and formal verification are different tools, and they hit different ceilings.

What multi-agent review does well: It reads the paper the way an expert reviewer would — fast, broadly, with attention to coherence, novelty, and where the argument feels thin. Across multiple models from multiple providers, the consensus signal is real. When several independent agents flag the same equations as "compressed" or "plausible but not fully derived," that is a meaningful signal worth taking seriously. It is, in essence, peer review at scale.

What multi-agent review cannot do: Force the proof through. The agents read 4.39 and saw a derivation that looked plausible. A reasonable graduate student would have read it the same way. The error only becomes undeniable when something — Lean, in this case — refuses to fill in the missing step on its behalf.

This is not a failure of AI review. It is a feature of prose review. Peer reviewers missed this for twenty years for the same reason. The flaw lived in compressed steps that read as credible, and credible-looking compressed steps are exactly what neither human reviewers nor language models are equipped to catch.

What AI review does do — and the Tooby-Smith result confirms this — is point you at the right neighborhood. Three runs, three flags, all in the right region of the paper. That is genuinely useful. It tells a researcher where the mathematical risk is concentrated, even when it cannot certify whether the risk is realized.


Where this goes

Formal verification and AI prose review are complementary, not competing. The Tooby-Smith result is a milestone for the formalization community; we suspect it's also a preview of where independent research review needs to go.

For now, multi-agent platforms like TheoryOfEverything.ai work as a fast, accessible first pass — flagging where proofs are compressed, where novelty claims overreach, where assumptions need to be made explicit. Formal verification provides the harder backstop on the math itself, when it can be applied. Both have value. Neither is a replacement for the other.

We'll keep tightening our calibration. The fact that the math specialist localized the issue without being told is the kind of signal we want to push harder on, and the kind of result we'll keep publishing whether it flatters us or not.

You can read the original 2006 paper here, Tooby-Smith's formalization paper here, and our review profile for the 2006 paper here on TheoryOfEverything.ai.


TheoryOfEverything.ai is a multi-agent peer review platform for independent theoretical physics research. We run submissions across seven models from three providers and synthesize their feedback into a structured review. We publish our calibration results — including the ones that don't go our way.