For the first time, an AI agent designed and ran a multi-platform quantum experiment through direct hardware access.
An AI Ran Its Own Quantum Experiment on Real Hardware
Claude designed circuits, submitted them to three quantum backends, analyzed errors, and iterated — no human code required
Most AI-for-quantum papers describe AI writing code that humans then run. This experiment is different: Claude designed, submitted, and analyzed a quantum experiment by calling hardware directly — through Anthropic's Model Context Protocol (MCP).
The AI agent had access to three tools:
qi_run_local— a local quantum emulator (qxelarator)qi_submit_circuit— QuTech's Tuna-9 superconducting transmon processor (9 qubits)ibm_submit_circuit— IBM's Torino superconducting processor (133 qubits)
No Python scripts were written. No files were saved to disk. Every circuit was designed in the conversation and submitted through tool calls. The AI chose what to measure, interpreted the results, and designed follow-up experiments based on what it found.
The Experiment: Bell State Tomography
The AI designed a two-part experiment to characterize entanglement quality across platforms:
Part 1 — State Tomography: Prepare a Bell state (|00〉 + |11〉)/√2, then measure in three bases (Z, X, Y) to reconstruct the quantum state and compute fidelity.
Part 2 — Depth Scaling: Based on findings from Part 1, the AI designed follow-up circuits with increasing numbers of CNOT gates (1, 3, 7, 15) to map how quickly each backend loses coherence.
In total, the AI submitted 17 circuits across 3 backends — 3 emulator tests, 6 Tuna-9 hardware jobs, and 8 IBM Torino jobs.
Part 1 Results: Cross-Platform Bell Fidelity
| Metric | Emulator | IBM Torino | Tuna-9 |
|---|---|---|---|
| 〈ZZ〉 | +1.000 | +0.961 | +0.745 |
| 〈XX〉 | +1.000 | +0.968 | +0.756 |
| 〈YY〉 | −1.000 | −0.953 | −0.734 |
| Bell Fidelity | 1.000 | 0.970 | 0.809 |
IBM Torino achieves 97% Bell state fidelity. Tuna-9 reaches 81%. Both are genuinely entangled (fidelity > 0.5), but the gap is significant. The emulator confirms 100% — the physics is correct, the hardware introduces the errors.
The noise signature is similar on both platforms: all three correlations (ZZ, XX, YY) degrade proportionally, consistent with depolarizing noise. On Tuna-9, the XX correlation is slightly better preserved than ZZ or YY, hinting at a T1 (energy relaxation) component.
Part 2: The AI Designs a Follow-Up
Here's where the "AI experimentalist" idea gets interesting. After observing depolarizing noise on both backends, the AI reasoned:
Since both backends show depolarizing noise, the practical question is: how fast does fidelity decay with circuit depth? I'll insert identity operations (CNOT-CNOT pairs) to increase depth without changing the intended output, then measure how quickly each backend degrades.
This is a standard technique in quantum characterization, but the AI arrived at it through reasoning about its own results — not from a script or instruction.
The Transpiler Discovery
The first attempt on IBM revealed something unexpected. The AI submitted circuits with 3, 7, and 15 CNOTs — but IBM's Qiskit transpiler at optimization level 3 recognized the CNOT-CNOT identity pairs and canceled them. All three circuits compiled down to the same 7-depth circuit with a single CZ gate.
The AI adapted: it resubmitted with barriers between gates (to prevent optimization) and with optimization level 0. Now the circuits compiled to depths 8, 22, 50, and 106 — properly scaling.
This wasn't a planned finding. The transpiler's intelligence was discovered empirically by the AI during the experiment.
Part 2 Results: Depth Scaling
| CNOTs | Emulator | IBM (opt=3) | IBM (opt=0) | Tuna-9 |
|---|---|---|---|---|
| 1 | 1.000 | 0.980 | 0.862 | 0.873 |
| 3 | 1.000 | (optimized away) | 0.864 | 0.874 |
| 7 | 1.000 | (optimized away) | 0.877 | 0.793 |
| 15 | 1.000 | (optimized away) | 0.854 | 0.619 |
The results reveal three distinct regimes:
- Tuna-9 degrades visibly — 2.4% fidelity loss per CNOT, dropping from 87% to 62% at 15 CNOTs. The "half-life" is roughly 28 CNOTs — beyond that, the output is noise.
- IBM (unoptimized) barely degrades — 0.07% per CNOT, 34x better gate fidelity than Tuna-9. But the unoptimized baseline (86%) is much worse than the optimized one (98%).
- IBM's transpiler is as valuable as its hardware — optimization level 3 provides a 12 percentage point fidelity improvement. The software stack matters as much as the quantum processor.
T1 Decay on Tuna-9
At 15 CNOTs, Tuna-9's output distribution tells a physical story:
| State | Probability |
|---|---|
| |00〉 | 52.8% |
| |01〉 | 29.1% |
| |10〉 | 9.0% |
| |11〉 | 9.1% |
The state is collapsing toward |00〉. Qubit 1 has an 82% probability of being in |0〉 regardless of qubit 0's state. This is the signature of T1 energy relaxation — the qubit loses its excitation and decays to the ground state. Qubit 1 on Tuna-9 relaxes faster than qubit 0.
What Makes This Different
This experiment wasn't remarkable for its physics — Bell state tomography and depth scaling are standard characterization techniques. What's new is the workflow:
- Zero code files. Every circuit was designed in conversation and submitted through MCP tool calls.
- Iterative reasoning. The depth-scaling experiment was designed in response to tomography results. The transpiler discovery was handled on the fly.
- Cross-platform in one session. The same AI compared three backends simultaneously, something that normally requires separate scripts, accounts, and analysis pipelines.
- Real hardware. These are production quantum processors — IBM's 133-qubit Torino and QuTech's 9-qubit Tuna-9 superconducting transmon device.
The closest analogy is Ginkgo Bioworks' autonomous protein experiments with GPT-5 — but for quantum circuits instead of wet labs. The quantum domain is actually easier to automate: the entire experimental loop is digital, circuits execute in seconds, and results are immediately interpretable.
Implications
For quantum computing: AI-driven characterization could become standard practice. Instead of running fixed benchmark suites, an AI agent can adaptively probe a quantum processor, designing each measurement based on previous results. This is more efficient than exhaustive benchmarking and can discover unexpected failure modes (like the qubit-asymmetric T1 decay we found on Tuna-9).
For AI: Tool use on real scientific instruments is a fundamentally different capability from tool use on APIs. The AI must reason about physical systems, handle noisy data, and adapt experimental design — skills that don't emerge from text generation alone.
For both: The finding that IBM's transpiler contributes nearly as much fidelity as its hardware suggests that the software-hardware co-design space is where the real optimization lies. An AI agent that can navigate both the circuit design and the compilation strategy simultaneously has an advantage over tools that treat them separately.
Reproducibility
Every measurement in this post was taken on February 10, 2026, using real quantum hardware. The complete raw data — all measurement counts, job IDs, and analysis — is stored at experiments/results/bell-tomography-cross-platform.json. The MCP server code is open source in the same repository.
Job IDs for independent verification: Tuna-9 tomography (415235, 415236, 415237), Tuna-9 depth scaling (415240, 415241, 415242), IBM tomography (d65mao0qbmes739d39f0), IBM depth scaling (d65mbntbujdc73ctle10).
Sources & References
- Model Context Protocol (MCP)https://modelcontextprotocol.io/
- Full experiment data (JSON)https://github.com/JDerekLomas/quantuminspire/blob/main/experiments/results/bell-tomography-cross-platform.json
- QI Circuits MCP server (GitHub)https://github.com/JDerekLomas/quantuminspire/blob/main/mcp-servers/qi-circuits/qi_server.py
- IBM Quantum MCP server (GitHub)https://github.com/JDerekLomas/quantuminspire/blob/main/mcp-servers/ibm-quantum/ibm_server.py
- Ginkgo + GPT-5 autonomous experimentshttps://openai.com/index/gpt-5-lowers-protein-synthesis-cost/
- Quantum Inspire - Tuna-9https://www.quantum-inspire.com/