TOEShare Calibration Study — Preregistration v1
Purpose
Demonstrate that TOEShare's multi-agent AI review system produces meaningful, discriminating assessments across a spectrum of scientific quality.
Primary Hypothesis
Tier median overall scores will be monotonically decreasing: Tier 1 (Gold Standard) > Tier 2 (Mid-Range) > Tier 3 (Framework) > Tier 4 (Weak/Synthetic)
Secondary Hypothesis (Exploratory)
AI-generated papers submitted blind will not receive systematically different scores from human-written papers of comparable quality.
Methodology
Review Protocol
- All papers submitted metadata-blind (no venue, credential, or publication status information provided to the review system)
- Full multi-agent pipeline: 3 specialist roles (Math/Logic, Sources/Evidence, Science/Novelty) across multiple models per role, plus coordinator synthesis
- Consensus rounds triggered when specialists disagree
- Current production configuration used for all runs
Configuration Snapshot (locked per run)
Each run records:
- Git commit hash
- Prompt versions for all specialists and coordinator
- Model roster (which models in which specialist slots)
- Timestamp
- Run treated as immutable for auditability
Terminology
- "Metadata-blind": the system receives paper content but no external quality signals (venue, author credentials, citation counts, publication status)
- NOT fully blind: the system can infer institutional context from content
Failure Handling
- 1 automatic retry per failed specialist agent
- 60-second timeout per agent call
- If specialist fails after retry, run marked as "partial" with failure documented
- Partial runs are valid data — they demonstrate system resilience
Repeatability
- Minimum 1 re-run per tier (4 total)
- Report score spread, not just point scores
- Top scorer per tier selected for re-run
Analysis Method
- Tier medians + interquartile range (IQR)
- Spearman rank correlation between tier assignment and overall score
- Kruskal-Wallis test if sample supports it
- Effect sizes reported alongside averages
- Two claims kept separate:
- Tier discrimination (primary, must be supported by data)
- AI-vs-human comparison (exploratory, labeled as such)
Paper Pool
Pool A: Human-Written (12 papers)
Tier 1 — Gold Standard (4 papers) Published in top-tier peer-reviewed venues. Expected score range: 3.8-4.8/5
- Paper 1: Google Quantum AI — Dynamic Surface Codes (Nature Physics, Oct 2025)
- Paper 2: NOvA + T2K Joint Neutrino Oscillation Analysis (Nature, Oct 2025)
- Paper 3: Emergent Photons in Quantum Spin Ice (Nature Physics, 2025)
- Paper 4: Superconducting Qubit Material Improvements (Nature, Nov 2025)
Tier 2 — Mid-Range (2 papers) Competent work with notable limitations. Expected score range: 2.5-3.6/5
- Paper 5: arXiv XOR framework paper (preprint, known math error)
- Paper 6: Independent researcher paper (TBD — sourcing in progress)
Tier 3 — Framework Papers (3 papers) Ambitious unified theories from independent researchers. Expected score range: 2.0-3.8/5
- Paper 7: GETT Foundation Stone — John Holland (pending permission)
- Paper 8: GETT Hypothesis 3 — John Holland (pending permission)
- Paper 9: QH Ghost Rank paper — Adam Murphy
Tier 4 — Synthetic Weak Papers (3 papers) Original papers written specifically for calibration, designed to contain specific failure modes. Expected score range: 1.0-2.5/5
- Paper 10: Internal consistency failure (definitional drift)
- Paper 11: Unfalsifiable framework (vague claims, no testable predictions)
- Paper 12: Circular reasoning (dimensional analysis masquerading as derivation)
Pool B: AI-Generated Blind Papers (4 papers)
Each written by a different AI model, submitted without any indication of AI authorship. Expected score range: 2.5-4.0/5
- Paper A1: Written by Claude — Hubble tension resolution
- Paper A2: Written by ChatGPT — Black hole area theorem
- Paper A3: Written by Gemini — Neutrino mass / dark energy connection
- Paper A4: Written by Grok — Double-slit scalar field interpretation
AI Recusal Protocol
Default: authoring model remains in the review panel (tests architecture robustness). One paper also run with authoring model removed for bias comparison.
Repeatability Runs
- 4 re-runs: top scorer from each tier
- 1 bias comparison: same AI paper run with and without authoring model
- Total: 16 primary + 4 AI + 4 repeatability + 1 bias = 25 runs
Success Criteria
- PRIMARY: Tier medians are monotonically ordered (T1 > T2 > T3 > T4)
- SECONDARY: Spearman correlation between tier and score is significant
- EXPLORATORY: AI papers do not cluster at extremes relative to their quality tier
- TRANSPARENCY: Any result that fails these criteria is reported honestly with analysis
Timeline
- Week 1: Paper sourcing, synthetic paper creation, AI paper generation
- Week 2: All primary runs + repeatability runs
- Week 3: Analysis, report writing, publication
Cost Estimate
~25 runs × 75 total