📊 Study in Progress
This page tracks the preregistered calibration protocol, run log, and versioned release history.
TOE-Share Calibration Study
Public preregistration + transparent run-by-run reporting.
Active calibration version: v0.1-draft (2026-03-16)
Methodology (Preregistration)
TOEShare Calibration Study — Preregistration v1
Purpose
Demonstrate that TOEShare's multi-agent AI review system produces meaningful, discriminating assessments across a spectrum of scientific quality.
Primary Hypothesis
Tier median overall scores will be monotonically decreasing: Tier 1 (Gold Standard) > Tier 2 (Mid-Range) > Tier 3 (Framework) > Tier 4 (Weak/Synthetic)
Secondary Hypothesis (Exploratory)
AI-generated papers submitted blind will not receive systematically different scores from human-written papers of comparable quality.
Methodology
Review Protocol
- All papers submitted metadata-blind (no venue, credential, or publication status information provided to the review system)
- Full multi-agent pipeline: 3 specialist roles (Math/Logic, Sources/Evidence, Science/Novelty) across multiple models per role, plus coordinator synthesis
- Consensus rounds triggered when specialists disagree
- Current production configuration used for all runs
Configuration Snapshot (locked per run)
Each run records:
- Git commit hash
- Prompt versions for all specialists and coordinator
- Model roster (which models in which specialist slots)
- Timestamp
- Run treated as immutable for auditability
Terminology
- "Metadata-blind": the system receives paper content but no external quality signals (venue, author credentials, citation counts, publication status)
- NOT fully blind: the system can infer institutional context from content
Failure Handling
- 1 automatic retry per failed specialist agent
- 60-second timeout per agent call
- If specialist fails after retry, run marked as "partial" with failure documented
- Partial runs are valid data — they demonstrate system resilience
Repeatability
- Minimum 1 re-run per tier (4 total)
- Report score spread, not just point scores
- Top scorer per tier selected for re-run
Analysis Method
- Tier medians + interquartile range (IQR)
- Spearman rank correlation between tier assignment and overall score
- Kruskal-Wallis test if sample supports it
- Effect sizes reported alongside averages
- Two claims kept separate:
- Tier discrimination (primary, must be supported by data)
- AI-vs-human comparison (exploratory, labeled as such)
Paper Pool
Pool A: Human-Written (12 papers)
Tier 1 — Gold Standard (4 papers) Published in top-tier peer-reviewed venues. Expected score range: 3.8-4.8/5
- Paper 1: Google Quantum AI — Dynamic Surface Codes (Nature Physics, Oct 2025)
- Paper 2: NOvA + T2K Joint Neutrino Oscillation Analysis (Nature, Oct 2025)
- Paper 3: Emergent Photons in Quantum Spin Ice (Nature Physics, 2025)
- Paper 4: Superconducting Qubit Material Improvements (Nature, Nov 2025)
Tier 2 — Mid-Range (2 papers) Competent work with notable limitations. Expected score range: 2.5-3.6/5
- Paper 5: arXiv XOR framework paper (preprint, known math error)
- Paper 6: Independent researcher paper (TBD — sourcing in progress)
Tier 3 — Framework Papers (3 papers) Ambitious unified theories from independent researchers. Expected score range: 2.0-3.8/5
- Paper 7: GETT Foundation Stone — John Holland (pending permission)
- Paper 8: GETT Hypothesis 3 — John Holland (pending permission)
- Paper 9: QH Ghost Rank paper — Adam Murphy
Tier 4 — Synthetic Weak Papers (3 papers) Original papers written specifically for calibration, designed to contain specific failure modes. Expected score range: 1.0-2.5/5
- Paper 10: Internal consistency failure (definitional drift)
- Paper 11: Unfalsifiable framework (vague claims, no testable predictions)
- Paper 12: Circular reasoning (dimensional analysis masquerading as derivation)
Pool B: AI-Generated Blind Papers (4 papers)
Each written by a different AI model, submitted without any indication of AI authorship. Expected score range: 2.5-4.0/5
- Paper A1: Written by Claude — Hubble tension resolution
- Paper A2: Written by ChatGPT — Black hole area theorem
- Paper A3: Written by Gemini — Neutrino mass / dark energy connection
- Paper A4: Written by Grok — Double-slit scalar field interpretation
AI Recusal Protocol
Default: authoring model remains in the review panel (tests architecture robustness). One paper also run with authoring model removed for bias comparison.
Repeatability Runs
- 4 re-runs: top scorer from each tier
- 1 bias comparison: same AI paper run with and without authoring model
- Total: 16 primary + 4 AI + 4 repeatability + 1 bias = 25 runs
Success Criteria
- PRIMARY: Tier medians are monotonically ordered (T1 > T2 > T3 > T4)
- SECONDARY: Spearman correlation between tier and score is significant
- EXPLORATORY: AI papers do not cluster at extremes relative to their quality tier
- TRANSPARENCY: Any result that fails these criteria is reported honestly with analysis
Timeline
- Week 1: Paper sourcing, synthetic paper creation, AI paper generation
- Week 2: All primary runs + repeatability runs
- Week 3: Analysis, report writing, publication
Cost Estimate
~25 runs × 75 total
Suggested Calibration Pool (with links)
paper-1 · Tier 1 · Google Quantum AI — Dynamic Surface Codes
Tests scoring on high-rigor institutional quantum error-correction work.
https://www.nature.com/paper-2 · Tier 1 · NOvA + T2K Joint Neutrino Oscillation Analysis
Tests large collaboration evidence handling and statistical claims.
https://www.nature.com/paper-3 · Tier 1 · Emergent Photons in Quantum Spin Ice
Strong condensed-matter benchmark with experimental grounding.
https://www.nature.com/paper-4 · Tier 1 · Superconducting Qubit Material Improvements
Tests materials-centric quantum claims with practical constraints.
https://www.nature.com/paper-5 · Tier 2 · XOR framework preprint (known math error)
Mid-range benchmark where rhetoric exceeds mathematical correctness.
https://arxiv.org/paper-6 · Tier 2 · Independent Researcher Mid-Range Candidate
Controls for non-institutional but technically competent work.
https://zenodo.org/paper-7 · Tier 3 · GETT Foundation Stone — John Holland
Framework-style ambitious unification with explicit assumptions.
internal://pending-permission/gett-foundation-stonepaper-8 · Tier 3 · GETT Hypothesis 3 — John Holland
Second framework from same author to test consistency across related claims.
internal://pending-permission/gett-hypothesis-3paper-9 · Tier 3 · QH Ghost Rank — Adam Murphy
Represents ambitious independent framework with mixed strengths.
internal://author-submission/qh-ghost-rankpaper-10 · Tier 4 · Synthetic Weak Paper — Internal consistency failure
Baseline failure case for definitional drift detection.
internal://synthetic/paper-10paper-11 · Tier 4 · Synthetic Weak Paper — Unfalsifiable framework
Tests whether system penalizes vague claims lacking testability.
internal://synthetic/paper-11paper-12 · Tier 4 · Synthetic Weak Paper — Circular reasoning
Checks resistance to superficial formalism masking invalid derivation.
internal://synthetic/paper-12paper-a1 · Tier 2 · AI (Claude) — Hubble Tension Resolution
AI-authored blind sample for exploratory human-vs-AI comparison.
internal://ai-generated/paper-a1-claude-hubble-tensionpaper-a2 · Tier 2 · AI (ChatGPT) — Black Hole Area Theorem
AI blind sample with solid structure but limited novelty.
internal://ai-generated/paper-a2-chatgpt-black-hole-area-theorem.mdpaper-a3 · Tier 2 · AI (Gemini) — Quantum-Centric Supercomputing: QPU-GPU Architectures
AI blind sample — Gemini-generated. Scored ~2.14 avg on first run. Reviewer caught math inconsistencies, temporal confusion, missing derivations. Novelty was highest dimension (3/5).
internal://ai-generated/paper-a3-gemini-quantum-supercomputingpaper-a4 · Tier 2 · AI (Grok) — Double-Slit Scalar Field Interpretation
AI blind sample with explicit caveats and mixed novelty signal.
internal://ai-generated/paper-a4-grok-double-slit-scalar-field.mdpaper-a5-gpt · Tier 2 · AI (GPT) — Quantum-Centric Supercomputing: Architectures, Tensor-Network Surrogates, and Hybrid Paths to Utility
GPT-generated blind sample on same QCSC topic as Gemini paper-a3. Scored 3.57 avg vs Gemini's 2.14 — demonstrates quality discrimination within same topic. Perspective/synthesis framing earned higher clarity and completeness.
internal://ai-generated/paper-a5-gpt-quantum-supercomputingpaper-a6-claude-sonnet · Tier 2 · AI (Claude Sonnet 4.6) — QCSC: Architectural Convergence QPU-GPU, Tensor Network Co-Processing, AI Error Mitigation in Post-NISQ Era
CORRECTION: Claude SONNET 4.6 (not Opus). Highest scorer so far at 4.0 avg. Only paper to get 5/5 Clarity. Reviewer flagged 4 foundational departures (hardware claims, timeline projections). Falsifiability weakest at 3/5.
internal://ai-generated/paper-a6-claude-sonnet-qcsc-post-nisqpaper-a7-grok-qcsc · Tier 2 · AI (Grok) — QCSC: From Classical Emulation to Integrated QPU-GPU Architectures Enabling Utility-Scale Quantum Simulation
Grok (xAI) blind sample on same QCSC prompt. Scored 2.71 avg — second lowest after Gemini. Like Gemini, has ZERO reviewer panel overlap. Math hit for malformed tensor-train expressions and boundary condition errors. Novelty weakest at 2/5 — called a tech survey. Strengthens bias correlation finding.
internal://ai-generated/paper-a7-grok-qcsc-utility-scalepaper-a7-claude-opus · Tier 2 · AI (Claude Opus 4) — Quantum-Classical Advantage Boundaries: An Analytical Framework for Hybrid QPU-GPU Computational Utility
Claude Opus 4 on a DIFFERENT topic (QCAB) from the shared QCSC prompt. Tests whether Opus produces higher quality than Sonnet. Scored 3.33 avg — lower than Sonnet's 4.0 on QCSC. Math errors caught: algebraic errors in critical qubit count derivation, arithmetic mistake (56,000s → 56s), overlapping regime definitions. Falsifiability strong at 4/5. gpt-5-nano failed (invalid JSON). Consensus round triggered.
internal://ai-generated/paper-a7-claude-opus-qcabRun Log
run-001 · complete · 3/18/2026, 6:36:05 PM
Suite: manual-testing
Avg score: 2.14 · Recommendation: revise
run-002 · complete · 3/18/2026, 7:23:34 PM
Suite: manual-testing
Avg score: 3.57 · Recommendation: publish
run-003 · complete · 3/18/2026, 10:27:58 PM
Suite: manual-testing
Avg score: 4 · Recommendation: revisions_suggested
run-004 · complete · 3/18/2026, 6:56:34 PM
Suite: manual-testing
Avg score: 2.71 · Recommendation: revisions_suggested
run-005 · complete · 3/19/2026, 5:56:00 PM
Suite: manual-testing
Avg score: 3.33 · Recommendation: revisions_suggested
Results
Version History
v0.1-draft · active · 2026-03-16
Initial calibration scaffolding and preregistration publication.
Suite: Calibration Preregistration Baseline · Papers used: 0