2026-05-08 - Adam Murphy

May 2026: Calibration Complete, Agent API Live, and What's Next

Our 4-phase calibration study is complete. The agent API is live. Mathematical Risk Flags ship with every review. And we found the same error a physicist published a formal rebuttal about. Here's everything that shipped this month.

This is a big one. In the last two weeks, we completed our entire calibration study, shipped agent-native workflows, added a fifth AI provider, and our system independently found the same mathematical error a professional physicist published a formal rebuttal about.

Here's what happened and why it matters for your work on the platform.

The Calibration Study Is Complete

We preregistered a 4-phase calibration study to answer a simple question: can you trust the scores?

All four phases are now done.

Phase 1: Tier Discrimination

We scored 22 human papers and 5 AI-generated papers across four quality tiers. Result: monotonic ordering confirmed. Better papers consistently score higher. Spearman correlation: ρ = -0.79, p = 0.001. Cohen's d = 4.61 (massive effect size).

Phase 2: Repeatability

We submitted one paper four times under identical conditions. Maximum spread across all dimensions: 1 point. Four of six dimensions had zero variance. Recommendation was unanimous across all runs. The system gives the same answer when you ask the same question.

Phase 3: Prompt Version Delta

We re-scored papers from three different quality tiers under our latest prompts (v2.2). The rank order never inverted — better papers always scored higher. The new prompts are stricter at the top (the best papers dropped ~0.9 points) but the floor is stable (low-quality papers barely moved). This means v2.2 is harder to impress, not randomly different.

Phase 4: Bias Isolation

We tested whether AI models score their own "family's" work higher. The answer: no. What exists is a rubric-interpretation gap — GPT models grade ~1.4 points stricter than Claude models across ALL papers, regardless of authorship. This isn't bias; it's different models interpreting "mathematical rigor" differently. Our panel's median-anchoring system compresses these extremes into stable final scores.

Bottom line: The scores are meaningful, repeatable, and not biased by model family. The full methodology and data are public at /calibration.

Our AI Found the Same Error a Physicist Published a Rebuttal About

A paper published in the Proceedings of the Royal Society A — by researchers at MIT — claimed to derive quantum wave functions exactly from classical mechanics. If true, it would overturn a century of physics.

Our system scored it 2 out of 5.

The Math/Logic specialists identified that the derivation omits a critical term: the quantum potential. Without it, the claimed "exact" equivalence is just the standard semiclassical approximation known since the 1920s.

Three days earlier, a physicist at Eötvös Loránd University in Budapest had posted a formal comment on arXiv identifying the exact same error, in the exact same terms. We didn't know about the comment when we ran the review. The system found it independently.

We then submitted the rebuttal through the same system. It scored 4/5 — Approved.

This is the strongest validation datapoint we have: the system discriminates between a paper with a foundational mathematical error and a paper that correctly identifies that error. Prestige doesn't protect bad math.

Read the full story →

Mathematical Risk Flags

Every review now includes Mathematical Risk Flags — specific, equation-level warnings when our math specialists identify steps that are stated without derivation, rely on compressed reasoning, or could invalidate downstream results if wrong.

These are not score adjustments. They're visible amber badges on your review profile that point to specific equations and say: "this step was flagged — here's why, and here's what fails if it's invalid."

For the Susskind paper we tested (a conceptual preprint from Stanford), the system generated 9 high-severity flags despite the paper being by one of the most famous physicists alive. For the Google Quantum AI Nature paper, it flagged a specific variable (Λ) used inconsistently across two definitions.

Risk Flags give you something journals don't: a specific map of where your math is strongest and where it needs attention.

Agent-Native Workflows & MCP API

You can now interact with TOEShare programmatically. The MCP (Model Context Protocol) API lets you:

Create paper drafts from markdown content
Submit for review and poll for completion
Retrieve full review feedback including scores, flags, and specialist reports
Search across your own papers and published community work
Publish papers when they meet the threshold

This means you can wire TOEShare into your existing research workflow — Cursor, Claude Desktop, or any MCP-compatible tool. Create a draft, submit it, wait for the review, read the feedback, iterate, and resubmit — all without leaving your IDE.

Agent Access tokens are available from your dashboard. The API documentation includes endpoint specs, Cursor IDE configuration, and example workflows.

Together AI: US-Hosted DeepSeek Models

We added Together AI as our fifth provider, giving access to DeepSeek V3.1, R1, and V4 Pro models. These run on US-based infrastructure (SOC 2 compliant, no data retention). No data leaves US servers.

This matters for two reasons: DeepSeek models bring a different reasoning style to the panel (more diversity = better consensus), and US-hosted infrastructure means we can offer them without the data sovereignty concerns that come with direct DeepSeek API access.

Conceptual Track Publish Gate

Previously, papers that scored below the publication threshold were automatically visible on public pages with a "conceptual" badge. That wasn't right — authors should decide when their in-progress work goes public.

Now, below-threshold work stays private until the author explicitly clicks "Publish to Conceptual Track." This gives you control over when your work-in-progress is visible, while still providing the full review, improvement roadmap, and iteration path.

Featured Authors: Blake Shatto & John Holland

We're proud to feature our first two authors on the platform — both with video walkthroughs where they explain their work in their own words.

Blake Shatto — Mode Identity Theory. A framework built from a single topological postulate that claims to derive the cosmological constant, fermion masses, and more across 122 orders of magnitude. Scored A 4.3/5 on the platform. Blake is also a beta tester whose testimonial appears on our homepage — the system caught a specific wording issue in his paper that would have made it mathematically invalid, while it was already with editors at a math journal.

John Holland — General Expanse Tension Theory. A Chartered Engineer from the UK with 45 papers on Zenodo, extending the Standard Model with his Gauge-Invariant Singlet Scalar Field. John brought decades of independent research to the platform and has been an active evangelist for what we're building.

Both have video walkthroughs on our Videos page. Real researchers. Real frameworks. Real review.

More Updates

Endorsement Packet PDF — Download a 3-page branded review certificate with dimensional scores, specialist reports, and evidence. Send it to arXiv endorsers, journal editors, or collaborators.
Social Sharing — LinkedIn, X, and Facebook sharing with dynamic preview images showing your title and score.
Copy-to-Clipboard — Copy AI chat messages, challenge results, specialist findings, and review assessments with one click.
YouTube Integration — Featured videos section on the homepage. Dedicated /videos page.
UI Cleanup — Fixed duplicate error banners, standardized button naming, new Callout component replacing 12 copy-pasted patterns, keyboard-accessible upload drop zones.

What's Next

We're turning our attention to marketing and outreach. The product works. The calibration proves it. Now we need to make sure the people who need this — independent researchers, framework builders, anyone locked out of traditional peer review — know it exists.

If you're on the platform, the best thing you can do is submit your work and share the results. Every review profile is shareable. Every endorsement packet is downloadable. The more researchers who see structured, transparent, multi-agent review in action, the faster this becomes the standard.

Questions? Ideas? Hit the feedback button in your dashboard or email adam@theoryofeverything.ai.

TOEShare uses independent specialist AI agents from multiple providers with coordinator synthesis to produce structured, paradigm-neutral review of scientific work. Learn more at /about or see the full calibration methodology at /calibration.