Preprint · verifiable-run harness
A prompt harness that grades a model's quantum design with a hermetic, deterministic judge.
We point a capable model at a hard quantum design problem and grade its answer with a hermetic, deterministic judge — four gates, no human in the scoring loop, every claim reproducible on a laptop (numpy only, no QPU). A submission is re-derived from scratch and rejected if it violates its constraints, fabricates a result, underperforms, or overfits a held-out check. The figures below run the real bench math in your browser and are interactive.
Why this exists
QuantumMytheme is a place to do your own harness-preparation work: mint a public run, point your own model at a problem, and produce a result anyone can re-check. Three reasons it's worth your time.
Every accepted run adds to an open, reproducible, re-verifiable corpus of quantum designs. Correctness is scored without human taste — the simulator recomputes the number, so a result holds up or it doesn't.
The same hidden-graded problems let you compare design approaches head to head — which ansatz, which topology, which feature map (and how each beats the classical baseline) currently leads. The frontier is public.
Point a capable model at a BRIEF and watch it loop to ACCEPT — then try to beat the best verified score with fewer gates, a sparser map, or a simpler feature map. Hill-climb on a number a machine checks for you.
Models are model-agnostic fuel. The judge doesn't care who — or what — produced a bundle; it only re-simulates. Drive a run with Opus 4.8 or Fable 5 today; the harness is built to be ready for the next-gen models you may know as Mythos.
The judge
Every submission passes through four gates in order; the first it fails sets the exit code and the verdict. Choose a submission and run it.
Figure 1. A proof bundle is re-derived from scratch and passed through the four gates; the first failure decides the exit code.
State preparation
Stepping the GHZ circuit. The sphere shows qubit 0; the bars show the full state, with colour encoding the complex phase of each amplitude. When the entangling gate fires, qubit 0's vector retreats from the surface to the centre — the signature of a maximally entangled state.
Figure 2. Qubit 0 of the GHZ circuit on the Bloch sphere; the amplitude bars are coloured by phase (hue) and amplitude (brightness).
Architecture — topology
Routing a workload of qubit interactions across a coupling map. A ring routes both the visible and the held-out workload within budget; a linear path tuned to the visible pairs exceeds the held-out budget and is rejected at the anti-overfit gate.
Select Path with the held-out workload: the topology that aced the visible pairs now routes [0–3] the long way and exceeds the budget.
Figure 3. Routing cost is the summed shortest-path distance over the required interactions; the held-out workload is the anti-overfit gate.
Quantum machine learning
A quantum feature map Ry(scale·x) labels points by the sign of ⟨X⟩ = sin(scale·x). Filled dots are training data; ringed dots are the held-out test set. Raise the frequency: the curve still threads every training point, but the test points are misclassified — what the held-out gate catches.
Figure 4. Decision curve of the feature map; held-out test accuracy below threshold triggers the anti-overfit gate (exit 6).
Scoreboard · the current frontier
A per-problem leaderboard of judge-ACCEPTED designs, ranked by the verified metric. Seeded with the harness's reference baselines; every number is the judge's own and re-verifiable (scoreboard/verify.py → 5/5 exit 0). No score here is self-reported.
| Problem · task | Paradigm | Verified metric | Cost | Model | Proof |
|---|---|---|---|---|---|
| ghz3 · state_prep | chain-cascade | fidelity 1.000≥ 0.99 · base 0.5 | 2q 2 · depth 3 | reference-baseline | bundle ↗ |
| isingbell2 · vqe | minimal-bell-ansatz | gap 0.000 to E₀=−2budget 0.05 · base −1 | 2q 1 · depth 2 | reference-baseline | bundle ↗ |
| bell_pops2 · populations | phase-correct-bell | ⟨X₀X₁⟩ +1.00held-out · pops dev 0 | 2q 1 · depth 2 | reference-baseline | bundle ↗ |
| aiaccel4 · architecture | ring | routing 2budget 2 · base 4 · held-out 2 | edges 4 · deg 2 | reference-baseline | bundle ↗ |
| qml_sign1 · classify | low-frequency-encoding | test 100%held-out · train 100% | ops 1 · 1 qubit | reference-baseline | bundle ↗ |
Table 1. Seeded leaderboard. Model = reference-baseline — hand-authored worked examples, not a model run; the bar to beat. A real run names the model it pointed at the BRIEF and links its own public run repo. Model is provenance, never a ranking key — the judge re-simulates regardless of author. ⚛ marks a hardware overlay (the design also run on a device; the sim score stays the rank).
Do your own run
Each run lives in its own public repo. You bring the model — your Claude subscription, or API / token credits. The judge never holds your credits; it only structures the run and verifies the output.
The run repo is the permanent, public, re-verifiable record. Anyone re-runs the judge on your committed bundle and gets the same verdict — that's the whole contract.