Preprint · verifiable-run harness

Quantum design, machine-checked.

A prompt harness that grades a model's quantum design with a hermetic, deterministic judge.

Abstract

We point a capable model at a hard quantum design problem and grade its answer with a hermetic, deterministic judge — four gates, no human in the scoring loop, every claim reproducible on a laptop (numpy only, no QPU). A submission is re-derived from scratch and rejected if it violates its constraints, fabricates a result, underperforms, or overfits a held-out check. The figures below run the real bench math in your browser and are interactive.

4active gates
5task types
26/26judge suite
82/82measurement
numpyonly · offline

Why this exists

A public workbench for verified quantum design

QuantumMytheme is a place to do your own harness-preparation work: mint a public run, point your own model at a problem, and produce a result anyone can re-check. Three reasons it's worth your time.

01

Contribute to science

Every accepted run adds to an open, reproducible, re-verifiable corpus of quantum designs. Correctness is scored without human taste — the simulator recomputes the number, so a result holds up or it doesn't.

02

A scoreboard across paradigms

The same hidden-graded problems let you compare design approaches head to head — which ansatz, which topology, which feature map (and how each beats the classical baseline) currently leads. The frontier is public.

03

For the curious, to hill-climb

Point a capable model at a BRIEF and watch it loop to ACCEPT — then try to beat the best verified score with fewer gates, a sparser map, or a simpler feature map. Hill-climb on a number a machine checks for you.

Models are model-agnostic fuel. The judge doesn't care who — or what — produced a bundle; it only re-simulates. Drive a run with Opus 4.8 or Fable 5 today; the harness is built to be ready for the next-gen models you may know as Mythos / Lythos.

The judge

Four gates, first failure wins

Every submission passes through four gates in order; the first it fails sets the exit code and the verdict. Choose a submission and run it.

re-simulated, not trusted
GATE 01
exit 3
Structure
GATE 02
exit 4
Reproducibility
GATE 03
exit 5
Performance
GATE 04
exit 6
Anti-overfit
ready The anti-overfit gate fires only when a problem holds out a check the model never saw — an observable, a workload, or a test set.

Figure 1. A proof bundle is re-derived from scratch and passed through the four gates; the first failure decides the exit code.

State preparation

Entanglement leaves the surface

Stepping the GHZ circuit. The sphere shows qubit 0; the bars show the full state, with colour encoding the complex phase of each amplitude. When the entangling gate fires, qubit 0's vector retreats from the surface to the centre — the signature of a maximally entangled state.

Figure 2. Qubit 0 of the GHZ circuit on the Bloch sphere; the amplitude bars are coloured by phase (hue) and amplitude (brightness).

Architecture — topology

A topology that generalizes, or overfits

Routing a workload of qubit interactions across a coupling map. A ring routes both the visible and the held-out workload within budget; a linear path tuned to the visible pairs exceeds the held-out budget and is rejected at the anti-overfit gate.

Select Path with the held-out workload: the topology that aced the visible pairs now routes [0–3] the long way and exceeds the budget.

Figure 3. Routing cost is the summed shortest-path distance over the required interactions; the held-out workload is the anti-overfit gate.

Quantum machine learning

The overfit, in one parameter

A quantum feature map Ry(scale·x) labels points by the sign of ⟨X⟩ = sin(scale·x). Filled dots are training data; ringed dots are the held-out test set. Raise the frequency: the curve still threads every training point, but the test points are misclassified — what the held-out gate catches.

scale = 1.0

Figure 4. Decision curve of the feature map; held-out test accuracy below threshold triggers the anti-overfit gate (exit 6).

Scoreboard · the current frontier

Seeded leaderboard — the baselines to beat

A per-problem leaderboard of judge-ACCEPTED designs, ranked by the verified metric. Seeded with the harness's reference baselines; every number is the judge's own and re-verifiable (scoreboard/verify.py → 5/5 exit 0). No score here is self-reported.

Problem · taskParadigmVerified metricCostModelProof
ghz3 · state_prepchain-cascade fidelity 1.000≥ 0.99 · base 0.52q 2 · depth 3 reference-baseline bundle ↗
isingbell2 · vqeminimal-bell-ansatz gap 0.000 to E₀=−2budget 0.05 · base −12q 1 · depth 2 reference-baseline bundle ↗
bell_pops2 · populationsphase-correct-bell ⟨X₀X₁⟩ +1.00held-out · pops dev 02q 1 · depth 2 reference-baseline bundle ↗
aiaccel4 · architecturering routing 2budget 2 · base 4 · held-out 2edges 4 · deg 2 reference-baseline bundle ↗
qml_sign1 · classifylow-frequency-encoding test 100%held-out · train 100%ops 1 · 1 qubit reference-baseline bundle ↗

Table 1. Seeded leaderboard. Model = reference-baseline — hand-authored worked examples, not a model run; the bar to beat. A real run names the model it pointed at the BRIEF and links its own public run repo. Model is provenance, never a ranking key — the judge re-simulates regardless of author.

Why each leads

  • ghz3 — perfect fidelity at the minimal cost for GHZ on a line; nothing cheaper reaches the target.
  • isingbell2 — the exact ground state (gap 0) at depth 2; entangling beats the product-state baseline.
  • bell_pops2 — matches the visible populations and the hidden ⟨X₀X₁⟩=+1; clears the gate the impostor fails.
  • aiaccel4 — a ring routes the visible and held-out workloads within budget; the overfit path is rejected.
  • qml_sign1 — 100% train and held-out test with one rotation; a high-frequency map overfits and can't qualify.

Do your own run

Bring your own model. Take rank 1.

Each run lives in its own public repo. You bring the model — your Claude subscription, or API / token credits. The judge never holds your credits; it only structures the run and verifies the output.

  1. Mint a run repo from the template — a fresh public repo in the org (or bin/new-run.sh).
  2. Pick or write a BRIEF — the problem stated conceptually; the answer key stays host-side with the judge.
  3. Run the kickoff with your model — it self-corrects against the rubric until the judge ACCEPTs.
  4. Commit the result back — the proof bundle, the verdict (exit 0), the scrubbed transcript, the scorecard; push, and register a scoreboard row.

The run repo is the permanent, public, re-verifiable record. Anyone re-runs the judge on your committed bundle and gets the same verdict — that's the whole contract.