← Calibration Study/Preregistration v1
Published March 16, 2026Locked before any formal runs began
See full results in White Paper v1.3 →

TOEShare Calibration Study — Preregistration v1

Purpose

Demonstrate that TOEShare's multi-agent AI review system produces meaningful, discriminating assessments across a spectrum of scientific quality.

Primary Hypothesis

Tier median overall scores will be monotonically decreasing: Tier 1 (Gold Standard) > Tier 2 (Mid-Range) > Tier 3 (Framework) > Tier 4 (Weak/Synthetic)

Secondary Hypothesis (Exploratory)

AI-generated papers submitted blind will not receive systematically different scores from human-written papers of comparable quality.

Methodology

Review Protocol

  • All papers submitted metadata-blind (no venue, credential, or publication status information provided to the review system)
  • Full multi-agent pipeline: 3 specialist roles (Math/Logic, Sources/Evidence, Science/Novelty) across multiple models per role, plus coordinator synthesis
  • Consensus rounds triggered when specialists disagree
  • Current production configuration used for all runs

Configuration Snapshot (locked per run)

Each run records:

  • Git commit hash
  • Prompt versions for all specialists and coordinator
  • Model roster (which models in which specialist slots)
  • Timestamp
  • Run treated as immutable for auditability

Terminology

  • "Metadata-blind": the system receives paper content but no external quality signals (venue, author credentials, citation counts, publication status)
  • NOT fully blind: the system can infer institutional context from content

Failure Handling

  • 1 automatic retry per failed specialist agent
  • 60-second timeout per agent call
  • If specialist fails after retry, run marked as "partial" with failure documented
  • Partial runs are valid data — they demonstrate system resilience

Repeatability

  • Minimum 1 re-run per tier (4 total)
  • Report score spread, not just point scores
  • Top scorer per tier selected for re-run

Analysis Method

  • Tier medians + interquartile range (IQR)
  • Spearman rank correlation between tier assignment and overall score
  • Kruskal-Wallis test if sample supports it
  • Effect sizes reported alongside averages
  • Two claims kept separate:
    1. Tier discrimination (primary, must be supported by data)
    2. AI-vs-human comparison (exploratory, labeled as such)

Paper Pool

Pool A: Human-Written (12 papers)

Tier 1 — Gold Standard (4 papers) Published in top-tier peer-reviewed venues. Expected score range: 3.8-4.8/5

  • Paper 1: Google Quantum AI — Dynamic Surface Codes (Nature Physics, Oct 2025)
  • Paper 2: NOvA + T2K Joint Neutrino Oscillation Analysis (Nature, Oct 2025)
  • Paper 3: Emergent Photons in Quantum Spin Ice (Nature Physics, 2025)
  • Paper 4: Superconducting Qubit Material Improvements (Nature, Nov 2025)

Tier 2 — Mid-Range (2 papers) Competent work with notable limitations. Expected score range: 2.5-3.6/5

  • Paper 5: arXiv XOR framework paper (preprint, known math error)
  • Paper 6: Independent researcher paper (TBD — sourcing in progress)

Tier 3 — Framework Papers (3 papers) Ambitious unified theories from independent researchers. Expected score range: 2.0-3.8/5

  • Paper 7: GETT Foundation Stone — John Holland (pending permission)
  • Paper 8: GETT Hypothesis 3 — John Holland (pending permission)
  • Paper 9: QH Ghost Rank paper — Adam Murphy

Tier 4 — Synthetic Weak Papers (3 papers) Original papers written specifically for calibration, designed to contain specific failure modes. Expected score range: 1.0-2.5/5

  • Paper 10: Internal consistency failure (definitional drift)
  • Paper 11: Unfalsifiable framework (vague claims, no testable predictions)
  • Paper 12: Circular reasoning (dimensional analysis masquerading as derivation)

Pool B: AI-Generated Blind Papers (4 papers)

Each written by a different AI model, submitted without any indication of AI authorship. Expected score range: 2.5-4.0/5

  • Paper A1: Written by Claude — Hubble tension resolution
  • Paper A2: Written by ChatGPT — Black hole area theorem
  • Paper A3: Written by Gemini — Neutrino mass / dark energy connection
  • Paper A4: Written by Grok — Double-slit scalar field interpretation

AI Recusal Protocol

Default: authoring model remains in the review panel (tests architecture robustness). One paper also run with authoring model removed for bias comparison.

Repeatability Runs

  • 4 re-runs: top scorer from each tier
  • 1 bias comparison: same AI paper run with and without authoring model
  • Total: 16 primary + 4 AI + 4 repeatability + 1 bias = 25 runs

Success Criteria

  • PRIMARY: Tier medians are monotonically ordered (T1 > T2 > T3 > T4)
  • SECONDARY: Spearman correlation between tier and score is significant
  • EXPLORATORY: AI papers do not cluster at extremes relative to their quality tier
  • TRANSPARENCY: Any result that fails these criteria is reported honestly with analysis

Timeline

  • Week 1: Paper sourcing, synthetic paper creation, AI paper generation
  • Week 2: All primary runs + repeatability runs
  • Week 3: Analysis, report writing, publication

Cost Estimate

~25 runs × 3/run= 3/run = ~75 total