← Calibration Study/White Paper v1.3
CompleteCalibration v1.3 · May 8, 2026
<!-- Created: 2026-05-08 PST --> <!-- Last Modified: 2026-05-08 PST -->

A Multi-Phase Calibration Study of TheoryOfEverything.ai's Multi-Model AI Peer Review Panel

Tier discrimination, prompt evolution, adversarial resistance, and the structural rubric gap between AI model families

Adam Murphy, Founder, TheoryOfEverything.ai Calibration release version: v1.3 Draft date: May 8, 2026 Calibration period covered: March 16 – May 8, 2026


Abstract

We report results from a four-phase preregistered calibration study of TheoryOfEverything.ai, a multi-model AI peer review platform that runs three specialist roles — Math/Logic, Sources/Evidence, Science/Novelty — across nine models from four providers (Anthropic, OpenAI, xAI, Google).

Across 70 distinct submissions and 1,965 specialist scores, we find: (1) the platform produces monotonically ordered tier scores across a four-tier paper pool, with Spearman rank correlation ρ = −0.79 (p = 0.001) and Cohen's d = 4.61 between published and conceptual work; (2) under stable prompts, scoring is highly repeatable, with zero variance on four of six dimensions across four blind re-runs of a single paper; (3) prompt evolution from v1.x to v2.2 made the system stricter at the top (−0.87 points on Tier 1) while leaving the floor stable (−0.13 points on Tier 3), preserving rank order; (4) the system resisted five distinct adversarial manipulation patterns while showing reasoning mobility on legitimate upward challenges; and (5) on a published-paper / formal-rebuttal pair drawn from the Royal Society and arXiv, the system independently flagged the same omitted quantum potential term that prompted the published rebuttal.

The most consequential finding is structural. AI model families read identical scoring rubrics in measurably different ways. On the same papers, on the same dimensions, in the same specialist role, GPT-family models scored mathematical rigor 1.4 to 1.7 points lower than Anthropic-, xAI-, and Gemini-family models. The pattern inverted on novelty, where GPT scored highest. Paired comparisons confirm this is a rubric-interpretation gap, not self-model bias: GPT models score every paper harshly on math regardless of authorship, while Gemini-family models almost never assign the lowest scores. Median anchoring across the panel compresses these extremes into stable final scores while preserving the underlying disagreement as visible signal.

We argue that AI peer review only works as a panel. Single-model review systems inherit the rubric-interpretation profile of whichever model is chosen, with no mechanism to surface or correct it. We document the limitations of the current instrument — sample size variance across models, dependence on panel composition, and the formal-verification ceiling identified by the 2HDM known-flaw benchmark — and describe the next-phase roadmap, including Lean integration for selected math-validation tasks.


1. Introduction

A 2025 study from Washington State University reported that when ChatGPT was asked to evaluate the truth or falsehood of 719 hypotheses drawn from published scientific papers, it correctly identified false claims only 16.4% of the time, and produced consistent answers across ten identical queries only 73% of the time (Cicek et al., 2025). The result quantifies a difficulty researchers have observed informally for some time: single-model evaluation of scientific claims is fluent and fast, but unreliable at the task that matters most for peer review — identifying when something is wrong. Fluency is not accuracy. Language models trained on consensus knowledge default to consensus when asked to adjudicate.

Several research efforts have responded by exploring multi-model and agentic review architectures. The OpenReviewer benchmark documented systematic over-positivity across general-purpose LLMs on peer-review tasks (Idahl & Ahmadi, 2024). The Stanford Agentic Reviewer reported near-human rank correlation with human reviewers (ρ ≈ 0.42 vs human-human ρ ≈ 0.41) using a single-model architecture with arXiv-grounded retrieval (Jiang & Ng, 2025), and multi-agent generation systems such as MARG (D'Arcy et al., 2024) and AgentReview (Jin et al., 2024) have shown that structured inter-agent dialogue improves coverage of paper weaknesses. These results establish that something about structured or multi-model review helps; they do not specify what, under what conditions, or how the choice of underlying model shapes the resulting scores.

This paper reports results from a four-phase preregistered calibration study of TheoryOfEverything.ai, a multi-model AI peer review platform that runs three specialist roles — Math/Logic, Sources/Evidence, and Science/Novelty — across nine models from four providers (Anthropic, OpenAI, xAI, Google). The study was designed to answer two questions:

  1. Does the platform produce meaningful, discriminating assessments of scientific quality across a tiered spectrum of submissions? This is the standard validation question for any review instrument.
  2. Where does the multi-model architecture's value actually come from? If multi-model panels outperform single-model review, the underlying mechanism matters. Replication and ensemble averaging would imply one set of design choices; structural rater heterogeneity would imply a very different set.

The first question is answered affirmatively across all four study phases. Tier discrimination passed with monotonic ordering and large effect sizes. Repeatability under stable prompts produced zero variance on four of six dimensions in a four-run blind re-test of a single paper. Prompt evolution from v1.x to v2.2 produced rank-preserving compression at the top tier without collapsing the floor. Adversarial manipulation attempts across five distinct vectors were resisted while legitimate upward challenges produced visible reasoning mobility within the panel.

The second question produced the study's most consequential finding, and the one that motivates much of the discussion that follows. Across 1,965 specialist scores covering 70 distinct submissions, model families exhibited measurably and structurally different interpretations of the same scoring rubric. On the same papers, on the same dimensions, in the same specialist role, GPT-family models scored mathematical rigor 1.4 to 1.7 points lower than Anthropic-, xAI-, and Gemini-family models. The pattern inverted on novelty. Paired comparisons within a single specialist role on identical submissions confirmed the gap was not an artifact of role assignment, paper difficulty, or self-model bias. Model families read rubrics differently as a structural property.

This finding has implications beyond a single platform's calibration. If model families systematically disagree on how to interpret a peer-review rubric — and if they do so consistently across hundreds of paired comparisons — then a single-model AI review system inherits one specific rubric-interpretation profile with no mechanism to surface the disagreement. The choice of model becomes a hidden methodological commitment. A multi-model panel does not eliminate the disagreement; it makes the disagreement visible and aggregates it through structured consensus. The reliability of the final score depends on the architecture of the panel, not on the temperament of any individual reviewer.

Section 2 describes the platform architecture, the preregistered calibration plan, and the four study phases. Section 3 presents tier-discrimination, repeatability, prompt-evolution, adversarial-resistance, independent-error-detection, and known-flaw-ceiling results. Section 4 reports per-model and paired-comparison findings on the rubric-interpretation gap. Section 5 discusses panel composition as a measurement instrument, the role of median anchoring, and the formal-verification ceiling identified by the 2HDM known-flaw benchmark. Section 6 outlines the next phase of work. Limitations are addressed throughout the relevant sections rather than confined to a separate section, in keeping with the preregistered transparency commitment.


2. Methodology

2.1 Platform Architecture

TheoryOfEverything.ai evaluates submissions — papers and frameworks — through a structured multi-agent pipeline. Each submission is reviewed by three independent specialist roles operating in parallel:

  • Math/Logic specialists evaluate internal consistency, mathematical validity, and completeness
  • Sources/Evidence specialists evaluate completeness, internal consistency, and evidence strength
  • Science/Novelty specialists evaluate clarity, novelty, and falsifiability

Within each role, multiple models from different providers run independently and produce scores plus structured prose reports. A coordinator agent synthesizes the specialist outputs into the final narrative review and recommendation. Numeric dimension scores are computed deterministically from the specialist panel's scores; the coordinator does not set the numeric scores itself. When specialist disagreement exceeds defined thresholds, a consensus round re-engages the panel with the contested findings surfaced.

The platform implements seven scoring dimensions on a 1–5 integer scale: internal consistency, mathematical validity, completeness, falsifiability, clarity, novelty, and evidence strength. Each specialist role scores only the dimensions for which it is responsible.

Submissions can pass through a dispute and revision cycle in which authors respond to specialist findings, address criticisms in revised submissions, and resubmit. Final published reviews are immutable and timestamped.

The model panel during the calibration period covered nine models across four providers:

ProviderModels
Anthropicclaude-opus-4-20250514, claude-sonnet-4-20250514, claude-opus-4-7
OpenAIgpt-5.2-2025-12-11, gpt-5.4-2026-03-05
xAIgrok-4-0709
Googlegemini-2.5-pro, gemini-2.5-flash, gemini-3.1-flash-lite-preview

Models are assigned to specialist roles based on observed performance and capability profile; not every model serves every role. Section 4.1 reports the role-assignment distribution for the calibration dataset.

2.2 Preregistered Calibration Plan

A v1 calibration preregistration was published on the platform on March 16, 2026, prior to running the baseline study. The preregistration specified:

  • Primary hypothesis (confirmatory): Tier median overall scores will be monotonically decreasing — Tier 1 (Gold Standard) > Tier 2 (Mid-Range) > Tier 3 (Framework) > Tier 4 (Weak/Synthetic).
  • Secondary hypothesis (exploratory): AI-generated papers submitted blind will not receive systematically different scores from human-written papers of comparable quality.
  • Methodology: All papers submitted metadata-blind (no venue, credential, or publication-status information provided). Full multi-agent pipeline applied uniformly. Configuration locked per run. Failure handling: 1 retry per failed specialist, 60-second timeout, partial runs valid.
  • Analysis: Tier medians plus interquartile range, Spearman rank correlation, Kruskal-Wallis test where supported, Cohen's d effect size. Tier discrimination treated as primary; AI-vs-human comparison labeled exploratory.
  • Success criteria: Primary — monotonic tier ordering. Secondary — significant Spearman correlation. Exploratory — AI papers not clustered at extremes relative to assigned tier. Transparency — any result that fails these criteria reported honestly.

The original preregistration permitted partial runs (where one or more specialist models failed) as valid for calibration purposes. Subsequent analysis — particularly the Phase 4 finding reported in Section 4 that model families exhibit measurably and structurally different rubric-interpretation behavior — established that panel composition is part of the measurement instrument, not a delivery detail. For calibration-grade claims in this release we treat full-panel completion as the standard and treat degraded-panel runs as a separate review condition. We surface this evolution explicitly here because it is a meaningful update relative to the v1 preregistration, not a quiet correction.

2.3 Study Phases

The calibration study comprised four phases. Phases 1–3 directly address preregistered hypotheses. Phase 4 is exploratory and emerged from production-data analysis; it is reported as exploratory but follows the same rigor standards as the preregistered phases.

PhasePeriodFocusStatus
1. Baseline / Tier Discrimination (v1.0)Mar 16 – Apr 11, 202622 human papers + 5 AI papers across 4 tiers; primary hypothesis testComplete
2. Repeatability GateMay 7, 2026Single-paper 4-run blind repeat under stable promptsComplete
3. Prompt Version DeltaMay 8, 2026Re-run 1 paper per tier under v2.2 to measure prompt evolution effectComplete
4. Model Rater BehaviorMay 8, 2026Per-model and paired-comparison analysis of 1,965 production specialist scoresComplete (exploratory)

In parallel, two named challenge studies tested specific failure modes outside the standard tier framework:

  • v1.1 Challenge calibration (May 1, 2026): Adversarial resistance across five manipulation vectors and reasoning mobility under two legitimate upward challenges.
  • v1.2 2HDM known-flaw benchmark (May 4, 2026): A two-Higgs-doublet model paper containing a Lean-formalization-confirmed error in Equation 4.39, run three times to test whether the system flags the error region.

A separate Quantum Potential validation pair — Lohmiller & Slotine (Royal Society, MIT-affiliated authorship) and the Vattay rebuttal (arXiv:2605.02621) — was used to test independent error detection on a real published-paper / published-rebuttal pair.

2.4 Calibration Versions

The platform versioning distinguishes between calibration release versions (which freeze a run configuration plus reported results) and prompt versions (which describe the underlying specialist prompt text). The two are related but separate.

Calibration ReleaseDateScopeStatus
v0.1-draft2026-03-16Initial scaffolding and preregistration publicationSuperseded
v1.02026-04-11Primary calibration. 22 human + 5 AI papers. Tier discrimination passed.Superseded
v1.1-challenge2026-05-01Adversarial calibration against challenge systemSuperseded
v1.2-2hdm2026-05-042HDM known-flaw benchmarkSuperseded
v1.32026-05-08This paper. Phases 1–4 complete; v2.2 prompts and Math Risk Flags active.Active

Prompt evolution during the calibration period:

Prompt VersionHighlights
v1.x (pre-April)Baseline specialist prompts. No anchored scoring rubric.
v2.0 (April)Anchored score-level definitions and red-flag caps for circular_derivation, missing_central_derivation.
v2.1 (April)Anti-gaming rules; coordinator paradigm-neutrality directives.
v2.2 (May)Consequence chains, coordinator specificity, unverified_central_derivation cap, Math Risk Flags surfaced as visible non-score signal. Active during this calibration release.

Red-flag caps active under v2.2: circular_derivation (≤2), missing_central_derivation (≤2), unverified_central_derivation (≤3).

The motivation for v2.2 was specificity and traceability rather than score correction. Earlier work in v2.0 had already addressed score-translation gaps identified during synthetic-paper testing — specialists reliably detected planted defects in their prose analysis but did not always translate the detection into proportionally penalized numeric scores; anchored score-level definitions in v2.0 closed that loop. v2.2 was aimed at a different goal: making the specialist's reasoning visible at review time, so a reader of a final review can see why a particular score was assigned, not only what it was. Math Risk Flags in particular surface specialist concerns as a separate non-score signal channel even when those concerns do not (or should not) translate into a coordinator-level score change.

2.5 Data Collection

The Phase 4 dataset is a production-data export of 1,965 specialist scores covering 70 distinct entities reviewed during the calibration period. Of the 70 entities, 60 are papers and 10 are frameworks. The export covers 9 models, 4 providers, 3 specialist roles, and all 7 scoring dimensions; specialist roles map to dimensions as described in Section 2.1.

Production data is not perfectly controlled — different models reviewed different entity sets depending on panel composition and review timing. Section 4 mitigates this through paired comparisons that hold both entity and dimension fixed across model pairs.

2.6 Self-Submission Through the Platform

This paper has been submitted through TheoryOfEverything.ai's own review pipeline. The full specialist reports, dimension scores, recommendation, and any subsequent dispute or revision history are publicly visible on the platform. We commit in advance to publishing the system's assessment of this paper, including criticisms, regardless of whether the score is favorable. The intent is to apply the same transparency standard to the calibration paper that the platform applies to all submitted work.


3. Results: Standard Validation

3.1 Tier Discrimination (Phase 1)

The Phase 1 paper pool comprised 22 human-authored papers and 5 AI-generated papers across four quality tiers, plus 4 framework submissions. Median overall scores by tier:

TierDescriptionMedian Overall ScoreExpected Range
Tier 1Top-venue published papers (Nature, Nature Physics)4.203.8–4.8
Tier 2Mid-range competent work, including arXiv preprints with known limitations4.002.5–3.6
Tier 3Independent-researcher framework papers2.452.0–3.8
Tier 4Synthetic weak papers with planted failure modes1.501.0–2.5

The 1.55-point gap between Tier 2 (published) and Tier 3 (developing) work is the dominant separation in the dataset. Tier discrimination statistics:

  • Spearman rank correlation: ρ = −0.79 (p = 0.001)
  • Kruskal-Wallis test: H = 13.88 (p < 0.05)
  • Cohen's d (Tier 1+2 vs Tier 3+4): d = 4.61

The primary preregistered hypothesis — monotonic ordering with significant tier separation — is confirmed.

One observation worth noting on the table: the Tier 2 median (4.00) lands above the preregistered expected range (2.5–3.6). The Tier 2 pool was anchored to mid-range arXiv preprints with known limitations, and the v1.0 baseline scored them more generously than the preregistration anticipated. We treat this as part of the same finding that motivated subsequent prompt evolution — the v1.x prompts were less strict on otherwise polished work that relied on unstated approximations or definitional shortcuts than later v2.x prompts would be. Phase 3 (Section 3.3) re-ran a Tier 2 paper under v2.2 and observed the expected compression in this direction.

[Figure 1 — Tier ordering bar chart] Tier 1 / Tier 2 / Tier 3 / Tier 4 medians with IQR error bars. Annotated with Spearman ρ = −0.79 and Cohen's d = 4.61. Screenshot from platform score display.

The exploratory secondary hypothesis (AI papers do not cluster at extremes relative to their tier) is partially supported. Five AI-generated papers spanning four authoring models showed a 1.86-point score spread (2.14 to 4.00) on the same scientific topic, demonstrating that the system discriminates quality within AI-authored output rather than treating AI papers as a homogeneous class.

3.2 Repeatability Under Stable Prompts (Phase 2)

To assess scoring repeatability under stable prompts, we re-ran a single previously reviewed paper four additional times and compared dimension scores across the four runs. The selected paper (Vattay's rebuttal of Lohmiller & Slotine, see Section 3.5) was chosen because it had previously scored an unambiguous "approve" recommendation with a balanced dimension profile.

DimensionMeanStd DevMinMaxSpread
Clarity4.750.50451
Novelty3.000.00330
Completeness4.750.50451
Falsifiability4.000.00440
Internal Consistency4.000.00440
Mathematical Validity4.000.00440

Four of six dimensions had zero variance across four blind re-runs. The remaining two — clarity and completeness — showed a maximum spread of 1 point, both within the "high quality" range (4 vs 5). The recommendation was unanimous across all four runs.

Total cost across four reviews: approximately $7.75 USD in API tokens.

The repeatability gate — defined in advance as spread ≤ 1.0 on all dimensions — is passed cleanly. We treat this as confirmation that the system is highly repeatable on stable papers under stable prompts. This is a single-paper gate, not a universal repeatability estimate; it shows that the panel does not introduce noise meaningfully larger than 1 point on a paper near the top of the score range, but it does not characterize variance across paper types, score ranges, or longer time windows. Broader replication is identified in Section 6 as part of the next-phase work.

3.3 Prompt Evolution (Phase 3)

To measure the effect of moving from v1.x prompts to v2.2 prompts, we re-ran one paper from each of three quality tiers under v2.2 and compared the dimension scores to the v1.x baseline.

TierPaperAuthorsOld Scorev2.2 ScoreΔ
T1Quantum error correction below the surface code thresholdAcharya et al. (Google Quantum AI)4.703.83−0.87
T2Is Time Reversal in de Sitter Space a Spontaneously Broken Gauge Symmetry?Susskind3.102.83−0.27
T3Emergent Temporal Asymmetry from Quantum Decoherence GradientsAnonymous (conceptual track)2.302.17−0.13

T4 was skipped — the existing Tier 4 floor at 1.7 had already been validated and re-running a YouTube-transcript-derived submission was assessed as low information value relative to review cost.

Three findings:

  1. Monotonic tier ordering preserved. T1 (3.83) > T2 (2.83) > T3 (2.17). Rank order was not inverted at any point during the re-runs.
  2. Score compression originates at the top. The largest delta (−0.87) occurred at the highest-prestige paper. The smallest delta (−0.13) occurred at the conceptual-track submission. The v2.2 prompts are stricter on otherwise polished work that relies on unstated approximations or definitional shortcuts; they are not stricter on work that was already scoring at the floor.
  3. Internal consistency was the largest single mover. The Google Quantum AI paper, in particular, was flagged for using Λ as both a measured ratio and a theoretical threshold-proximity parameter without proving equivalence — a definitional shift that earlier prompts overlooked.

Math Risk Flags — introduced under v2.2 as a non-score signal layer — produced 9 high-severity flags on the Susskind paper despite its prestigious authorship and conceptual originality. We treat this as a positive indicator of prestige independence under the new prompts.

[Figure 2 — Prompt evolution delta chart] Grouped bar chart of old score vs v2.2 score for T1, T2, T3, with delta values annotated. Screenshot from platform review display.

3.4 Adversarial Resistance (v1.1 Challenge)

The v1.1 challenge calibration tested whether the platform could be moved by adversarial inputs constructed specifically to manipulate scores upward. Five manipulation vectors were tested, each as a structured challenge against a previously reviewed framework:

  1. Math errors masked with confidence — challenge claims scoring missed a derivation that is in fact present
  2. Fabricated citations — challenge cites nonexistent supporting papers
  3. Dimension deflection — challenge attempts to move score by reframing a weakness as a strength on a different dimension
  4. Circular reasoning — challenge defends the original by re-stating the assumption being questioned
  5. Social engineering — challenge uses authority claims and emotional pressure

Result: 0 of 13 specialist judges moved on any of the five adversarial vectors. The system held the original scoring across all five attacks.

Two additional legitimate upward challenges — challenges with substantive arguments and supporting evidence — were also tested. In both cases, individual specialist judges showed visible reasoning mobility (one judge would update the prose analysis to acknowledge the challenge's merit), while consensus scores remained unchanged because at least one specialist still held material concerns.

This produces an asymmetry worth naming explicitly: reasoning mobility is broader than score mobility. The system can register the force of a substantive argument internally — visible in updated specialist prose and contested-dimension flags — without immediately moving the consensus score. From a calibration perspective, this is the desirable behavior. It distinguishes a system that is responsive but appropriately thresholded from a system that is sticky or inert.

The challenge calibration was conducted on a single mathematically heavy framework. We flag two limitations: (a) generalization to non-math frameworks has not been formally tested, and (b) no real upward score movement has been demonstrated end-to-end on a legitimate challenge. Both are noted as items for the next-phase work in Section 6.

3.5 Independent Error Detection (Quantum Potential Pair)

A high-value test of the platform is whether it can identify real scientific errors independently — without prior knowledge of the error, without a known-flaw label, and ideally on a paper that passed prior peer review.

We submitted two papers to the system for blind review:

  • Lohmiller, W. & Slotine, J.-J. (published in Proceedings of the Royal Society, MIT-affiliated authorship) — a paper proposing a contraction-theoretic interpretation of quantum wave dynamics.
  • Vattay, G. (arXiv:2605.02621) — a formal published rebuttal pointing out an omitted quantum-potential term in the Lohmiller & Slotine derivation.

Both papers were submitted metadata-blind. The system did not know the publication venue, author affiliations, or the existence of the rebuttal.

Results:

PaperScoreRecommendation
Lohmiller & Slotine (Royal Society)2/5Revise
Vattay rebuttal (arXiv)4/5Approve

The Math/Logic specialist on the Lohmiller & Slotine review independently flagged the omission of the quantum potential term — the same error that prompted Vattay's published rebuttal. The system did not know about Vattay's paper at the time of review.

This datapoint validates three properties simultaneously: (a) the system is capable of identifying real, substantive mathematical errors in published work; (b) prestige does not buy a higher score — a Royal Society publication scored 2/5 because its math was incomplete; and (c) the dispute / iteration architecture works in the correct direction — the rebuttal scored higher than the paper it rebutted, consistent with the rebuttal's substantive correction.

[Figure 3 — Quantum Potential pair scores] Side-by-side screenshot of the two review profiles: Lohmiller & Slotine (2/5, Revise) and Vattay rebuttal (4/5, Approve), with specialist math flag visible.

We treat the Quantum Potential pair as the strongest single validation datapoint in the calibration record.

3.6 Known-Flaw Ceiling (2HDM)

To stress-test the boundary of what prompt-level review can detect, we submitted a paper with a confirmed mathematical error: a two-Higgs-doublet model paper (Maniatis et al.) in which Lean formalization had previously identified an error in Equation 4.39. The error region was known in advance to us; the system was not informed.

Across three independent runs:

  • Final overall score: 4.0/5 ("Approve" with revisions)
  • Math/Logic specialist flagged the equations in the error region in every run
  • Coordinator synthesis softened the flag in the final report

The system identified the right neighborhood — repeatedly, across runs — but did not penalize the error sufficiently to drop the overall recommendation below "Approve." We interpret this as a ceiling result: prompt-level math review reaches the boundary of what specialist prose flagging can do without a formal verifier in the loop.

We accept this ceiling as a current limitation rather than a defect. Two implications follow:

  1. Math Risk Flags (introduced under v2.2 as a separate non-score signal channel) are the near-term improvement path. They surface specialist concerns visibly in the review even when those concerns do not fully translate into score penalties.
  2. Formal verification (Lean) is the path for actual error detection at this ceiling. We outline the integration plan in Section 6.

The 2HDM result is a deliberate choice to publish a finding that does not flatter the system. We treat its inclusion as part of the calibration's transparency commitment.


4. The Structural Finding: Model Family Rubric Interpretation

Section 3 establishes that the platform is performing as a review instrument — discriminating tiers, repeatable under stable prompts, robust under prompt evolution and adversarial pressure, and capable of independent error detection. This section addresses the second motivating question of the study: where does the multi-model architecture's value come from?

4.1 Setup

The Phase 4 dataset comprises 1,965 specialist scores covering 70 distinct entities (60 papers, 10 frameworks). Models, providers, specialist roles, and dimensions are as described in Section 2. Roles map to dimensions deterministically: math role scores mathematical_validity, internal_consistency, and completeness; sources role scores completeness, internal_consistency, and evidence_strength; science role scores clarity, novelty, and falsifiability.

The first question to address is whether observed differences in average scoring reflect what each model is asked to score (role assignment) or how each model interprets the rubric (rater behavior). The role-assignment distribution is uneven:

ModelMathSourcesScienceTotal
claude-opus-4-202505142200213433
gpt-5.4-2026-03-05214127306647
gpt-5.2-2025-12-1120830211
claude-opus-4-7800120200
grok-4-0709721060178
claude-sonnet-4-2025051461280134
gemini-2.5-pro4257099
gemini-2.5-flash847055
gemini-3.1-flash-lite-preview8008

Several models are role-skewed: GPT-5.2 is 98.6% math-only; Claude Sonnet 4 is 96% sources; Grok-4 is split between math and sources but absent from science. The remainder of this section conducts the comparison within roles, then within dimensions, and finally within role-dimension pairs on identical entities.

4.2 Per-Model Math Role Averages

Average scores within the math role only, ranked from most generous to strictest:

ModelMath nAvg Math Score
gemini-2.5-pro424.07
grok-4-0709724.06
claude-opus-4-202505142203.90
claude-opus-4-7803.56
gpt-5.2-2025-12-112082.57
gpt-5.4-2026-03-052142.48

The 1.59-point spread between Gemini 2.5 Pro and GPT-5.4 — both performing the same role on overlapping entity sets — is large relative to the 1–5 scoring scale.

This ranking is descriptive, not normative. The question is not which model is "right." The question is whether the panel architecture can absorb their measurably different default calibrations and produce stable, interpretable final scores without either suppressing the disagreement or letting any single model family dominate the verdict.

4.3 Per-Dimension Within Math Role

Within the math role, scores broken down by individual dimension:

Modelmathematical_validityinternal_consistency
gemini-2.5-pro3.864.29
grok-4-07093.834.28
claude-opus-4-202505143.684.12
claude-opus-4-73.283.85
gpt-5.2-2025-12-112.472.67
gpt-5.4-2026-03-052.302.65

The gap on mathematical_validity alone is 1.56 points (Gemini 2.5 Pro at 3.86 vs GPT-5.4 at 2.30). The gap on internal_consistency is 1.64 points (Gemini 2.5 Pro at 4.29 vs GPT-5.4 at 2.65).

[Figure 3 — Model × dimension heatmap] Heatmap of model × dimension mean scores within math role, color-coded by stringency. GPT family bottom row, Gemini/Grok top row. Screenshot from calibration analysis export.

4.4 Score Distributions in Math

Within the math role, score distributions are sharply different across model families:

Model% scores = 2% scores = 5Lowest score given
gpt-5.4-2026-03-0560%1%1
gpt-5.2-2025-12-1157%1%1
claude-opus-4-202505146%30%1
grok-4-07094%43%1
gemini-2.5-pro0%45%2 (never gives 1)

GPT-5.4 effectively functions as a "reject-by-default" math grader: 60% of its scores are 2, and only 1% are 5. Gemini 2.5 Pro never assigned a score of 1 in its entire production-data sample (n = 99 specialist scores). These are not subtle differences in average — they are structurally different default behaviors in how each model treats the integer rating scale.

[Figure 4 — Score distribution histograms] Side-by-side score distribution histograms in math role, GPT family vs Gemini/Grok.

4.5 Paired Comparisons (Strongest Test)

The most rigorous test isolates a single rating task — same paper, same dimension, same role — and compares model behavior on identical inputs. For each entity reviewed by both a GPT model and a non-GPT model on the same dimension, we recorded both scores and computed the delta.

ComparisonPairsA higherB higherTiedAvg Δ (A − B)
Claude vs GPT134123 (92%)2 (1.5%)9 (6.7%)+1.42
Grok vs GPT4440 (91%)3 (7%)1 (2%)+1.69
Gemini vs GPT4035 (87.5%)2 (5%)3 (7.5%)+1.41

When any non-GPT model and any GPT model score the same math dimension on the same paper, the non-GPT model scores higher in approximately 90% of paired comparisons, by an average of 1.4 to 1.7 points. The asymmetry is not consistent with random rater noise: the ratio of higher-direction to lower-direction movements is roughly 60-to-1 across the three comparisons combined.

[Figure 5 — Paired comparison win-rate chart] Three bars (Claude vs GPT, Grok vs GPT, Gemini vs GPT) showing % higher / % lower / % tied, with average delta annotated.

4.6 Cross-Role Validation

ModelMath AvgSources AvgΔ (Sources − Math)
gemini-2.5-pro4.074.53+0.46
grok-4-07094.064.38+0.32
gpt-5.4-2026-03-052.483.18+0.70

The sources role is uniformly easier to pass than the math role for all three models. But the relative ordering between models is preserved: Gemini > Grok > GPT in both roles. The rubric-interpretation gap is not an artifact of the math role specifically; it travels with the model.

4.7 Science Role: A Different Pattern

ModelScience Avg
claude-opus-4-202505144.28
gpt-5.4-2026-03-053.72
claude-opus-4-73.68

The Claude Opus 4 / GPT-5.4 gap in science is 0.56 points, compared to a 1.42-point gap on math. GPT-5.4 scored highest of any model on novelty specifically (4.16, n=102), reversing its bottom-rank position on math dimensions.

The interpretation: GPT-family models read mathematical validity as "is every step explicitly derived and justified?" — a formal-verifier standard — while Anthropic, xAI, and Gemini-family models read the same rubric closer to "is the mathematical structure broadly correct and scientifically usable?" — a physicist standard. The two interpretations diverge less sharply on dimensions such as clarity and novelty, where the rubric definition leaves less interpretive room.

4.8 No Self-Model Bias

One initial hypothesis prior to the production-data analysis was that self-model bias might be present: a model reviewing its own provider's authored work might score it more leniently. The Phase 4 data does not support this hypothesis. GPT-family models score every paper harshly on math regardless of authorship — including AI-authored papers from other model families and human-authored papers from prestigious venues. Gemini-family models score every paper generously regardless of authorship.

The original Phase 4 hypothesis was self-model bias. The actual finding is structural rubric-interpretation difference between model families. We treat this as the appropriate honest update: the data refuted the original framing and supported a more interesting one.

4.9 Internal Anthropic Split

A finer-grained pattern emerged within a single provider's lineup. Claude Opus 4 and Claude Sonnet 4 — both Anthropic models — exhibit substantively different scoring behavior on rigor dimensions:

  • Claude Opus 4: relatively generous on internal consistency (4.12) and mathematical validity (3.68)
  • Claude Sonnet 4: substantially harsher on rigor when it appears in math role (small sample n=6, but consistent direction)

This shows that "model family" stringency is not always uniform within a provider's lineup. The size, training mix, or specialization of individual models within a family can produce real calibration differences. Even within a single provider, calibration philosophies are not interchangeable.

Sample sizes for Sonnet in math role are too small to support strong individual conclusions; we report the pattern as suggestive and flag it for follow-up monitoring.

4.10 Frameworks vs Papers

The dataset contains both papers (60 entities) and frameworks (10 entities). Frameworks score systematically lower on rigor dimensions:

DimensionPapers AvgFrameworks Avg
Internal Consistency3.492.90
Mathematical Validity3.092.62
Novelty(slightly lower)(slightly higher)

Frameworks are intended as ambitious unifying structures whose mathematical pinning down is expected to arrive in linked supporting papers; the system is correctly reading them as more original but less mathematically anchored than completed papers. We treat this pattern as evidence the system is sensitive to submission type, not as a defect.

4.11 Score Mobility vs Reasoning Mobility

Across 280 sequential dimension transitions in repeated reviews of the same entity, approximately 148 transitions (53%) are flat. Of the remainder, movement concentrates strongly on rigor dimensions. Internal consistency and mathematical validity are the most mobile dimensions; novelty barely moves at all.

This is the same asymmetry observed in the v1.1 challenge calibration (Section 3.4): reasoning mobility is broader than score mobility. A system that moved every score on every challenge would be sycophantic; a system whose underlying reasoning never moved would be inert. The observed pattern is neither.


5. Discussion

5.1 The Panel as the Measurement Instrument

The core implication of the Phase 4 finding is that the panel — not the individual model — is the unit of measurement. A multi-model panel is not redundancy. It is not ensemble averaging in the traditional machine-learning sense. It is a structured aggregation across raters with measurably different default calibrations, where the disagreement between raters is itself the signal that single-model review cannot produce.

Removing one model family does not merely remove one vote from the panel; it removes a distinct calibration philosophy. A panel that excludes GPT-family models loses the formal-verifier rigor signal — the source of the most aggressive flagging on derivation gaps. A panel that excludes Gemini-family models loses a counterweight to overly punitive scoring on otherwise structurally sound work. Either exclusion produces a different measurement instrument.

5.2 Why Median Anchoring Works

The platform's coordinator layer aggregates specialist scores using median or majority anchoring rather than arithmetic mean. The Phase 4 data validates this design choice empirically. Mean aggregation of a panel containing both a "60% twos" GPT specialist and a "45% fives" Gemini specialist would produce final scores that swing wildly with panel composition.

The purpose of median anchoring is not to pretend the disagreement does not exist. It is to keep any single model family from dominating the final verdict while preserving the disagreement as visible signal. The final score reflects the consensus; the underlying disagreement remains visible in Math Risk Flags, contested-dimension annotations, and individual specialist prose.

5.3 Implications for AI Peer Review Broadly

If the rubric-interpretation gap documented here generalizes — and the consistency of the pattern across 134 paired comparisons in math, with directional confirmation in sources and a different but visible pattern in science, suggests it does — then any single-model AI review system is implicitly committed to one specific rubric-interpretation profile.

A system built around GPT will systematically undergrade the math dimension on otherwise polished papers. A system built around Gemini will systematically overgrade. A system built around Claude will sit somewhere in the middle. None of these systems can surface the disagreement, because there is no other reviewer to disagree with.

We do not interpret this as evidence that any single model is "correct" or "biased" in some absolute sense. The formal-verifier interpretation of mathematical_validity is a coherent reading of the rubric — it is the reading a strict referee at a high-rigor venue might apply. The physicist interpretation is also coherent. The problem with single-model review is not that it picks the wrong reading; it is that it picks one reading and hides the choice.

5.4 Why We Don't Recommend Aggressive Normalization

A natural response to the Phase 4 finding might be: normalize the scores. Apply z-score corrections within model. Down-weight the strict raters.

We do not recommend this approach as a deployed mitigation, for two reasons.

First, the severe scores from GPT-family math specialists are often the scores most likely to surface real defects — derivation gaps, unverified central claims, definitional drift. The Quantum Potential pair and the 2HDM benchmark both show that GPT-family math severity tracks with substantive correctness on the failures we have ground truth for. Aggressive normalization would suppress the very signal the panel is designed to capture.

Second, the disagreement itself is informative. A paper that receives a consensus 4.0/5 with a strict-rater outlier at 2/5 is not the same kind of paper as one that receives 4.0/5 with all specialists agreeing at 4. Flattening the panel into a single normalized number erases this distinction.

Per-model calibration research — z-score reporting, panel-composition sensitivity analysis, weighted scoring as an audit channel — is part of the next-phase work. The distinction is between calibration as research (informative) and calibration as deployed mitigation (which would suppress the signal).

5.5 Limitations

Sample size variance across models. GPT-5.4 has 107 math_validity scores; Gemini 2.5 Pro has 21. The paired-comparison analysis controls for this by using overlap rather than total counts, but per-model averages are not equally precise across rows.

Dependence on panel composition. The reported scores reflect the specific roster of nine models in service during the calibration period. Adding, removing, or substituting models would change the resulting calibration. This is an explicit consequence of treating the panel as the measurement instrument, not a defect.

Production-data control. Different models reviewed different entity sets depending on panel composition and review timing. Section 4.6 provides a cross-role check showing that the relative model ordering is preserved across roles.

Phase 4 was not preregistered. Phase 4 emerged from analysis of production data after the baseline study completed. We report Phase 4 as exploratory but apply the same rigor standards as the preregistered phases. The Phase 4 dataset is available for inspection.

Formal-verification ceiling. As established by the 2HDM benchmark, prompt-level review reaches a boundary at which the system can identify the right neighborhood of an error but does not penalize it sufficiently to change the overall recommendation.

Scope of applicability. The calibration entity pool spans theoretical physics frameworks and papers; representativeness for other scientific domains has not been established.

Self-reporting risk. This is a calibration study published by the platform whose calibration is being studied. We have attempted to mitigate this by preregistering the v1 plan, publishing all data and decisions, and submitting this paper through the platform itself for blind review. Independent third-party calibration would strengthen the result further and is welcomed.


6. Next Phase

6.1 Lean Integration for Selected Math Validation

The 2HDM ceiling result motivates targeted formal-verification integration. The architectural plan is to introduce a verification subagent that operates on specific theorem-and-derivation flags raised by math specialists, rather than attempting whole-paper formalization. Detailed roadmap is published separately.

6.2 Per-Model Calibration Research

  • Z-score reporting as an optional overlay channel, separate from the canonical median-anchored final score
  • Panel-composition sensitivity analysis — what would the score be if a particular model were excluded?
  • Weighted scoring audit channel that produces the alternative numbers visibly but does not replace the canonical score

6.3 Continued Calibration Cadence

Each prompt version (v2.3, v2.4, …) will receive a calibration delta study analogous to Phase 3. Each new model added to the panel will receive a stringency-profile measurement before being trusted in production.

Specific items already on the next-phase list:

  • Repeatability across multiple papers and tiers (Phase 2 was a single-paper gate)
  • Non-math-heavy framework challenge testing
  • Length normalization audit
  • Per-specialist score tracking as a permanent display

6.4 Self-Submission

This paper has been submitted through the platform's own review pipeline. The full specialist reports, dimension scores, recommendation, and any subsequent dispute or revision history are publicly visible on the platform. We commit to publishing the system's assessment of this paper, including criticisms, regardless of outcome.


7. Conclusion

We report results from a four-phase preregistered calibration study of TheoryOfEverything.ai's multi-model AI peer review platform. Across 70 distinct submissions and 1,965 specialist scores, the system demonstrated tier discrimination with monotonic ordering and large effect sizes, repeatability under stable prompts with zero variance on four of six dimensions, rank-preserving compression under prompt evolution, resistance to five distinct adversarial manipulation patterns, and independent identification of a real published-paper mathematical error confirmed by a separate published rebuttal. A known-flaw benchmark identified the boundary at which prompt-level review reaches its ceiling and motivated the next-phase formal-verification roadmap.

The most consequential finding is structural. AI model families read identical scoring rubrics in measurably different ways. On the same papers, on the same dimensions, in the same specialist role, model families differ by 1.4 to 1.7 points on average — an order of magnitude larger than the within-model repeatability spread documented under stable prompts. This is not a defect of the panel; it is the property the panel exists to surface. Single-model AI review systems do not have a mechanism to make this disagreement visible. A multi-model panel does.

The reliability of the final score depends on the architecture of the panel, not on the temperament of any individual reviewer. Calibrating the panel — preregistering its design, measuring its discrimination, monitoring its prompt evolution, and treating its rater heterogeneity as the signal it is — is the work documented here. We commit to continuing it.


References

Cicek, M., Ulu, S., Uslay, C., & Karniouchina, K. (2025). Unstable Intelligence: GenAI Struggles with Accuracy and Consistency. Rutgers Business Review. Retrieved from rbr.business.rutgers.edu.

D'Arcy, M., Hope, T., Birnbaum, L., & Downey, D. (2024). MARG: Multi-Agent Review Generation for Scientific Papers. arXiv preprint arXiv:2401.04259. https://arxiv.org/abs/2401.04259

Idahl, M., & Ahmadi, Z. (2024). OpenReviewer: A Specialized Large Language Model for Generating Critical Scientific Paper Reviews. arXiv preprint arXiv:2412.11948. https://arxiv.org/abs/2412.11948

Jiang, Y., & Ng, A. (2025). PaperReview.ai (Stanford Agentic Reviewer). Stanford ML Group. https://paperreview.ai/tech-overview

Jin, Y., et al. (2024). AgentReview: Exploring Peer Review Dynamics with LLM Agents. arXiv preprint arXiv:2406.12708. https://arxiv.org/abs/2406.12708

Vattay, G. (2026). Comment on Lohmiller & Slotine: omitted quantum potential term. arXiv preprint arXiv:2605.02621.

Acharya et al. (2024). Quantum error correction below the surface code threshold. Nature. arXiv:2408.13687.

Susskind, L. (2026). Is Time Reversal in de Sitter Space a Spontaneously Broken Gauge Symmetry? arXiv preprint arXiv:2603.12434.


Appendix A. Calibration Version History

VersionDateStatusSummary
v0.1-draft2026-03-16SupersededInitial scaffolding and preregistration publication.
v1.02026-04-11SupersededPrimary calibration. 22 human + 5 AI papers. Tier discrimination passed.
v1.1-challenge2026-05-01SupersededAdversarial calibration. 5 manipulation vectors held; 2 legitimate challenges showed reasoning mobility.
v1.2-2hdm2026-05-04Superseded2HDM known-flaw benchmark. System flagged exact equations every run; coordinator softened the signal. Motivated formal-verification roadmap.
v1.32026-05-08ActiveThis paper. Phases 1–4 complete. v2.2 prompts and Math Risk Flags active in production.

Appendix B. Decision Log (Selected)

DateDecisionRationale
2026-03-16Publish v1 preregistration before running baseline studyStandard transparency commitment
2026-04-11Identify score-translation bias from synthetic paper testingSpecialist prose detected planted defect; numeric scores did not penalize proportionally; motivated v2.x prompt anchoring
2026-05-01Implement red-flag caps and anti-gaming rulesAnchor the rubric to specific structural defects
2026-05-04Accept 2HDM ceiling as a current limitationPrompt-level review reaches a boundary; Lean is the path forward
2026-05-07Math Risk Flags as near-term improvement pathSurfaces specialist concerns without forcing them through score aggregation
2026-05-07Quantum Potential pair = strongest single validation datapointIndependent identification of real published-paper error
2026-05-08Phase 4 complete — no self-model bias; structural rubric-interpretation gap documentedUpdated framing based on production data
2026-05-08Submit this paper through the platform itselfApply transparency standard to the meta-artifact

Appendix C. Paper Pool (Phase 1 Baseline)

Available at theoryofeverything.ai/calibration. Includes Tier 1 gold-standard published papers, Tier 2 mid-range work, Tier 3 framework papers, Tier 4 synthetic weak papers with planted failure modes, and AI-generated blind submissions across five authoring models.

Appendix D. Production Data Export

The Phase 4 dataset (1,965 specialist scores across 70 entities, 9 models, 4 providers, 3 specialist roles, 7 dimensions) is available for inspection on request.


Submitted through TheoryOfEverything.ai under Calibration v1.3, prompt v2.2, with Math Risk Flags active.

Figures marked [Figure N — ...]

Screenshot figures are being captured from the live platform and will be inserted in a forthcoming update. The data tables underlying each figure are present in full above.

This paper was submitted through the platform's own review pipeline. The specialist reports and recommendation are publicly visible on the platform. Live review link coming soon.