Calibration Study — TheoryOfEverything.ai Multi-Model AI Peer Review Panel
Four-phase preregistered study · 70 submissions · 1,965 specialist scores · 9 models · 4 providers
TheoryOfEverything.ai's multi-model peer review panel has completed a four-phase preregistered calibration study covering 70 distinct submissions and 1,965 specialist scores across nine models from four AI providers. The study confirms that the platform produces monotonically ordered tier scores with large effect sizes (Spearman ρ = −0.79, Cohen's d = 4.61), scores repeatably under stable prompts (zero variance on four of six dimensions across four blind re-runs), preserves rank order under prompt evolution, resists five distinct adversarial manipulation patterns, and — on a published Royal Society paper and its arXiv rebuttal — independently flagged the same omitted derivation term that prompted the published rebuttal.
The most consequential finding is structural. AI model families interpret identical scoring rubrics in measurably different ways: on the same papers, in the same specialist role, GPT-family models scored mathematical rigor 1.4 to 1.7 points lower than Anthropic, xAI, and Gemini-family models across 134+ paired comparisons. The pattern inverted on novelty. This is the property a multi-model panel exists to surface — a single-model AI review system inherits one specific rubric-interpretation profile with no mechanism to expose the disagreement, while a calibrated panel makes it visible and aggregates it through structured consensus.
Preregistration · v1 · March 16, 2026
What we committed to test — before running it →
Primary hypothesis, methodology, success criteria, and analysis plan. Published before any formal runs began.
White Paper · v1.3 · May 8, 2026
Full results, methodology, and findings →
All four phases, per-model analysis, adversarial results, limitations, and next-phase roadmap.
Live Review · Coming Soon
The white paper reviewed through our own pipeline
This paper has been submitted through the platform. The specialist reports and recommendation will be publicly visible here once processing completes — applying the same transparency standard to the calibration paper that we apply to all submitted work.
Version history (prior releases, superseded)
Initial calibration scaffolding and preregistration publication.
Primary calibration report. 22 human papers + 5 AI papers. Tier discrimination passes (monotonic ordering confirmed). Score evolution documented. Merit-over-pedigree finding. AI model discrimination (1.86-point spread).
Challenge system adversarial calibration. 5 manipulation vectors (math errors, fabricated citations, dimension deflection, circular reasoning, social engineering) + 2 legitimate upward challenges. All held. System demonstrates adversarial resistance + reasoning mobility.
2HDM known-flaw benchmark. Paper has confirmed error in Theorem 1 (Eq. 4.39) found by Lean formalization. All 3 runs approved the paper (4.0/5 avg). Math specialist flagged exact equations in every run but coordinator softened the signal. Motivates Track 1 prompt fixes.