2026-03-20 - Adam Murphy

Five AI Models Wrote the Same Paper. Here's What Happened.

Same quantum-centric supercomputing prompt, five frontier models, one blind multi-agent review panel — a 1.9-point score spread, a dispute that held, and QCAB going from 3.3 to published in four revision cycles.

TOE-Share calibration report

We gave five AI models the same assignment: write a research paper on quantum-centric supercomputing. Same topic. Same prompt. No special instructions. Then we submitted each paper — blind — through TOE-Share's multi-agent review system.

The results tell you everything about how AI does science, where it fails, and why structured review matters.

The Experiment

The goal was calibration. We needed to know whether TOE-Share's review system could discriminate between papers of different quality — not just catch obvious garbage, but distinguish between subtle differences in rigor, honesty, and intellectual ambition.

So we ran the simplest possible test: hold the topic constant, vary the author, and see if the scores spread.

The five models:

Google Gemini
xAI Grok
OpenAI ChatGPT (GPT-5 family)
Anthropic Claude Sonnet
Anthropic Claude Opus

The review panel (identical for all five):

Coordinator: Claude Sonnet
Math/Logic specialists: GPT-5.2, Claude Opus, GPT-5.4
Sources/Evidence specialists: Claude Sonnet, GPT-5-nano
Science/Novelty specialists: Claude Sonnet, GPT-5.4

Every paper was reviewed by the same panel configuration. No model knew which model authored which paper.

The Scores

Model	Avg Score	Publication Status
Gemini	2.1/5	Conceptual Track
Grok	2.7/5	Conceptual Track
Claude Opus	3.3/5	Revisions Suggested
ChatGPT	3.6/5	Approved
Claude Sonnet	4.0/5	Approved

The spread — 1.9 points on a 5-point scale — is enormous for papers on the same topic. The system didn't just pass or fail them. It ranked them, and the ranking tells a story.

What Separated the Best from the Worst

Gemini (2.1/5): Confident and Wrong

Gemini wrote the most confident paper and scored the lowest. The system caught specific, verifiable errors:

Claimed "exact simulation of 200+ qubits" while using truncation methods with non-zero error — a direct self-contradiction
Asserted O(χ⁶) scaling for 2D PEPS contraction, contradicting the known O(χ¹²) result
A distance-3 surface code "encoding 12 logical qubits" — standard distance-3 codes encode 1
Timing math that didn't add up: 15μs + 42μs leaves only 3μs for feedback in a claimed 60μs cycle

Every one of these is a real physics error, not a stylistic complaint.

Grok (2.7/5): Better Structure, Same Problems

Grok wrote a better-organized paper with a limitations section and more careful language. The system rewarded this with higher clarity (4/5) and completeness (4/5). But the math was still schematic — malformed tensor-train expressions, unsupported complexity claims — earning the same 2/5 on mathematical validity as Gemini.

ChatGPT (3.6/5): The Honest Framing

ChatGPT did something the others didn't: it wrote a perspective piece instead of pretending to have experimental results. It said "here's how to think about this field" rather than "here are results from experiments we didn't run." The system rewarded this honesty with 4/5 on novelty and internal consistency.

Claude Sonnet (4.0/5): Safe and Sound

Sonnet wrote a careful architectural analysis. Perfect 5/5 on clarity. Strong on completeness. It didn't attempt novel derivations and it didn't make claims it couldn't support. The system gave it the highest score of the initial five.

Claude Opus (3.3/5): The One That Tried to Do Real Science

Opus scored lower than both ChatGPT and Sonnet. But it did something none of the others attempted: it tried to derive new mathematical results. It built a novel framework called QCAB — the Quantum-Classical Advantage Boundary — with six numbered results, closed-form boundary expressions, and worked applications.

The system recognized the ambition (4/5 novelty) but caught real math errors: an arithmetic mistake off by a factor of 1,000, a derivation that didn't follow from its own premises, and overlapping regime definitions in the phase diagram.

The Pattern

The ranking correlated with intellectual honesty, not model capability:

Papers that overclaimed experimental results scored low (Gemini, Grok)
Papers that framed honestly as analysis or perspective scored high (ChatGPT, Sonnet)
The paper that attempted real science scored in the middle — ambitious but flawed (Opus)

The system doesn't reward polish. It rewards rigor. A well-formatted paper with bad math gets caught. A modest paper with sound logic gets published.

The Dispute Test

After receiving its 2/5 on mathematical validity, Gemini challenged its own score through the dispute system. It argued that the O(χ⁶) scaling was justified by a quasi-1D mapping restriction.

The dispute was evaluated by GPT-5.2 — a different model from the original reviewers. The evaluation held the score: "That's a nice idea but you didn't prove it. Asserting a mechanism isn't the same as deriving one."

Three layers of validation in one interaction: the original review caught the error, the author challenged it, and a third model independently confirmed the finding.

Then Something Unexpected Happened

When we gave Opus the same prompt as the other four, it didn't just start writing. It asked: "Do you want me to write the same kind of paper as everyone else, or do you want me to go all in and do real science?"

We said: go all in.

That's when it produced the QCAB framework — a five-dimensional analytical model for predicting when hybrid quantum-classical systems achieve computational advantage. Not a literature review. Not a perspective piece. A theoretical framework with novel mathematical machinery, testable predictions, and quantitative thresholds.

The initial review scored it 3.3/5 and identified specific mathematical errors. What happened next became the most compelling demonstration of TOE-Share's value proposition we could have designed — except we didn't design it. It emerged from the process.

Four Revision Cycles

Round 1: 3.3/5 → Errors Identified

The system found:

A 1,000x arithmetic error (5×10⁻¹¹ × 2⁵⁰ = 56,295 seconds, not 56 seconds)
A critical qubit count derivation that didn't follow from its own balance equation
Phase diagram regimes that overlapped instead of partitioning the space

Round 2: 3.4/5 → Validation Added

We built Python code to test the framework against real experiments. The framework correctly predicted the outcomes of Kim et al. 2023, Google Sycamore 2019, H₁₀ VQE, and FeMo-cofactor. But adversarial testing revealed a problem: a trivial one-line rule (ε·d > 0.347 → classical wins) achieved the same 4/4 score. We needed harder test cases.

Round 3: 3.6/5 → Boundary Experiments

We searched the literature for experiments that would exercise the framework's downstream gates — not just the noise threshold. Quantinuum's H2 trapped-ion processor (56 qubits, ε ≈ 0.001) provided the ideal test bed. We added six boundary experiments from 2024-2025 hardware. The expanded validation: 10/10 correct predictions, 18/18 vs 13/18 against trivial classifiers.

GPT then diagnosed the specific mathematical fixes that Opus couldn't self-correct: exact PEC formula instead of approximation, corrected elasticity derivatives, separated compute-only crossover from latency penalty.

Round 4: 4.1/5 → Published

Three perfect scores: Falsifiability 5/5, Clarity 5/5, Novelty 5/5. The system recognized this as "the first systematic analytical framework for predicting quantum-classical computational boundaries." Eight specialists across five models. Published on TOE-Share.

The score trajectory: 3.3 → 3.4 → 3.6 → 4.1

Each revision addressed specific reviewer feedback. Each resubmission was evaluated independently. The improvement was monotonic and measurable.

What This Tells Us About AI-Assisted Science

What works:

Multi-agent review catches errors that single models miss. Opus couldn't find its own arithmetic error. The review panel caught it immediately.
The review-revise loop produces real improvement. Four cycles, each with measurable score increases, each addressing specific identified weaknesses.
Cross-model collaboration is more powerful than any single model. Opus built the framework. GPT diagnosed the math fixes. The mixed review panel evaluated fairly. No single model could have done all three.
Structured disagreement is signal. When one model says an equation is valid and another says it has a sign error, that conflict triggers deeper analysis. In a single-model conversation, the error gets buried in agreement.

What doesn't work:

AI models struggle with mathematical self-correction. Opus was told about the same latency calculation error across multiple revisions and kept producing new versions of the same mistake. Human direction was needed to break the loop.
Ambitious AI papers reliably contain math errors. The more a model tries to derive novel results, the more likely it is to make algebraic or arithmetic mistakes. Safe perspective papers score higher precisely because they don't attempt derivations that can fail.
A single AI model's evaluation is unreliable. This matches the WSU study finding that ChatGPT identifies false scientific claims correctly only 16.4% of the time. Multi-agent architecture isn't optional — it's necessary.

What This Tells Us About Calibration

The calibration system discriminates:

Across models: 1.9-point spread on the same topic
Across quality levels: Papers with real math errors score lower than papers with honest framing
Across revision cycles: Measurable improvement from 3.3 to 4.1
Against trivial baselines: The QCAB framework beats one-line classifiers 18/18 vs 13/18

The system rewards rigor over polish, honesty over confidence, and substance over length. Gemini's paper was beautifully formatted and scored 2.1. Sonnet's paper was careful and scored 4.0. The correlation is with intellectual integrity, not surface quality.

The Paper That Emerged

The QCAB framework — the paper that started as a calibration test — now stands as a genuine contribution to quantum computing theory. It correctly predicts the outcomes of experiments across Google, IBM, and Quantinuum hardware spanning six years. It explains the most contested result in recent quantum computing history (Kim et al. 2023) with a single line of arithmetic: ε·d = 0.02 × 60 = 1.2 >> 0.347. Done.

It wasn't planned. We were testing the review system. The review system tested itself by producing something worth publishing.

That's what happens when you build a platform that takes scientific rigor seriously. The rigor becomes the product.

Read the live paper (scores, specialist reports, full review history):
Quantum-Classical Advantage Boundaries: An Analytical Framework for Hybrid QPU-GPU Computational Utility →

Same URL on production: theoryofeverything.ai/papers/quantum-classical-advantage-boundaries-an-analytical-framework-for-hybrid-qpu-gpu-computational-utility

You can link directly to this call-to-action from anywhere with:
/blog/five-ai-models-same-paper-qcab#open-the-qcab-paper-on-toe-share

The computational validation code is described as supplementary material in the paper context above; review artifacts are visible on the public paper page.

TOE-Share uses independent specialist AI agents from multiple providers with coordinator synthesis to produce structured, paradigm-neutral review of scientific work. The platform is in active development with early beta users across theoretical physics, quantum computing, and materials science.