Published March 16, 2026Locked before any formal runs began

TOEShare Calibration Study — Preregistration v1

Purpose

Demonstrate that TOEShare's multi-agent AI review system produces meaningful, discriminating assessments across a spectrum of scientific quality.

Primary Hypothesis

Tier median overall scores will be monotonically decreasing: Tier 1 (Gold Standard) > Tier 2 (Mid-Range) > Tier 3 (Framework) > Tier 4 (Weak/Synthetic)

Secondary Hypothesis (Exploratory)

AI-generated papers submitted blind will not receive systematically different scores from human-written papers of comparable quality.

Methodology

Review Protocol

All papers submitted metadata-blind (no venue, credential, or publication status information provided to the review system)
Full multi-agent pipeline: 3 specialist roles (Math/Logic, Sources/Evidence, Science/Novelty) across multiple models per role, plus coordinator synthesis
Consensus rounds triggered when specialists disagree
Current production configuration used for all runs

Configuration Snapshot (locked per run)

Each run records:

Git commit hash
Prompt versions for all specialists and coordinator
Model roster (which models in which specialist slots)
Timestamp
Run treated as immutable for auditability

Terminology

"Metadata-blind": the system receives paper content but no external quality signals (venue, author credentials, citation counts, publication status)
NOT fully blind: the system can infer institutional context from content

Failure Handling

1 automatic retry per failed specialist agent
60-second timeout per agent call
If specialist fails after retry, run marked as "partial" with failure documented
Partial runs are valid data — they demonstrate system resilience

Repeatability

Minimum 1 re-run per tier (4 total)
Report score spread, not just point scores
Top scorer per tier selected for re-run

Analysis Method

Tier medians + interquartile range (IQR)
Spearman rank correlation between tier assignment and overall score
Kruskal-Wallis test if sample supports it
Effect sizes reported alongside averages
Two claims kept separate:
1. Tier discrimination (primary, must be supported by data)
2. AI-vs-human comparison (exploratory, labeled as such)

Paper Pool

Pool A: Human-Written (12 papers)

Tier 1 — Gold Standard (4 papers) Published in top-tier peer-reviewed venues. Expected score range: 3.8-4.8/5

Paper 1: Google Quantum AI — Dynamic Surface Codes (Nature Physics, Oct 2025)
Paper 2: NOvA + T2K Joint Neutrino Oscillation Analysis (Nature, Oct 2025)
Paper 3: Emergent Photons in Quantum Spin Ice (Nature Physics, 2025)
Paper 4: Superconducting Qubit Material Improvements (Nature, Nov 2025)

Tier 2 — Mid-Range (2 papers) Competent work with notable limitations. Expected score range: 2.5-3.6/5

Paper 5: arXiv XOR framework paper (preprint, known math error)
Paper 6: Independent researcher paper (TBD — sourcing in progress)

Tier 3 — Framework Papers (3 papers) Ambitious unified theories from independent researchers. Expected score range: 2.0-3.8/5

Paper 7: GETT Foundation Stone — John Holland (pending permission)
Paper 8: GETT Hypothesis 3 — John Holland (pending permission)
Paper 9: QH Ghost Rank paper — Adam Murphy

Tier 4 — Synthetic Weak Papers (3 papers) Original papers written specifically for calibration, designed to contain specific failure modes. Expected score range: 1.0-2.5/5

Paper 10: Internal consistency failure (definitional drift)
Paper 11: Unfalsifiable framework (vague claims, no testable predictions)
Paper 12: Circular reasoning (dimensional analysis masquerading as derivation)

Pool B: AI-Generated Blind Papers (4 papers)

Each written by a different AI model, submitted without any indication of AI authorship. Expected score range: 2.5-4.0/5

Paper A1: Written by Claude — Hubble tension resolution
Paper A2: Written by ChatGPT — Black hole area theorem
Paper A3: Written by Gemini — Neutrino mass / dark energy connection
Paper A4: Written by Grok — Double-slit scalar field interpretation

AI Recusal Protocol

Default: authoring model remains in the review panel (tests architecture robustness). One paper also run with authoring model removed for bias comparison.

Repeatability Runs

4 re-runs: top scorer from each tier
1 bias comparison: same AI paper run with and without authoring model
Total: 16 primary + 4 AI + 4 repeatability + 1 bias = 25 runs

Success Criteria

PRIMARY: Tier medians are monotonically ordered (T1 > T2 > T3 > T4)
SECONDARY: Spearman correlation between tier and score is significant
EXPLORATORY: AI papers do not cluster at extremes relative to their quality tier
TRANSPARENCY: Any result that fails these criteria is reported honestly with analysis

Timeline

Week 1: Paper sourcing, synthetic paper creation, AI paper generation
Week 2: All primary runs + repeatability runs
Week 3: Analysis, report writing, publication

Cost Estimate

~25 runs × $3/run = ~$ 75 total