CompleteCalibration v1.3 · May 8, 2026

Calibration Study — TheoryOfEverything.ai Multi-Model AI Peer Review Panel

Four-phase preregistered study · 70 submissions · 1,965 specialist scores · 9 models · 4 providers

TheoryOfEverything.ai's multi-model peer review panel has completed a four-phase preregistered calibration study covering 70 distinct submissions and 1,965 specialist scores across nine models from four AI providers. The study confirms that the platform produces monotonically ordered tier scores with large effect sizes (Spearman ρ = −0.79, Cohen's d = 4.61), scores repeatably under stable prompts (zero variance on four of six dimensions across four blind re-runs), preserves rank order under prompt evolution, resists five distinct adversarial manipulation patterns, and — on a published Royal Society paper and its arXiv rebuttal — independently flagged the same omitted derivation term that prompted the published rebuttal.

The most consequential finding is structural. AI model families interpret identical scoring rubrics in measurably different ways: on the same papers, in the same specialist role, GPT-family models scored mathematical rigor 1.4 to 1.7 points lower than Anthropic, xAI, and Gemini-family models across 134+ paired comparisons. The pattern inverted on novelty. This is the property a multi-model panel exists to surface — a single-model AI review system inherits one specific rubric-interpretation profile with no mechanism to expose the disagreement, while a calibrated panel makes it visible and aggregates it through structured consensus.

Preregistration · v1 · March 16, 2026

What we committed to test — before running it →

Primary hypothesis, methodology, success criteria, and analysis plan. Published before any formal runs began.

White Paper · v1.3 · May 8, 2026

Full results, methodology, and findings →

All four phases, per-model analysis, adversarial results, limitations, and next-phase roadmap.

Live Review · Coming Soon

The white paper reviewed through our own pipeline

This paper has been submitted through the platform. The specialist reports and recommendation will be publicly visible here once processing completes — applying the same transparency standard to the calibration paper that we apply to all submitted work.

Version history (prior releases, superseded)
v0.1-draft
2026-03-16superseded

Initial calibration scaffolding and preregistration publication.

v1.0
2026-04-11superseded

Primary calibration report. 22 human papers + 5 AI papers. Tier discrimination passes (monotonic ordering confirmed). Score evolution documented. Merit-over-pedigree finding. AI model discrimination (1.86-point spread).

v1.1-challenge
2026-05-01superseded

Challenge system adversarial calibration. 5 manipulation vectors (math errors, fabricated citations, dimension deflection, circular reasoning, social engineering) + 2 legitimate upward challenges. All held. System demonstrates adversarial resistance + reasoning mobility.

v1.2-2hdm
2026-05-04superseded

2HDM known-flaw benchmark. Paper has confirmed error in Theorem 1 (Eq. 4.39) found by Lean formalization. All 3 runs approved the paper (4.0/5 avg). Math specialist flagged exact equations in every run but coordinator softened the signal. Motivates Track 1 prompt fixes.