Methodology
How one human and one AI agent produced 100+ quantum experiments, 6 paper replications, and 20+ interactive tools — without writing a single line of code by hand.
The approach
This project uses quantum vibe coding — describing experiments in natural language and letting an AI coding agent (Claude Code) handle the translation to Qiskit/cQASM, hardware submission via MCP servers, result retrieval, and statistical analysis.
The human role is not to write code. It is to direct attention, ask skeptical questions, and frame what matters. The AI handles implementation, but the human decides what to investigate, when to doubt results, and how to communicate findings.
Below are 78 representative prompts from the 349 total, organized by the 5 workflow phases that emerged naturally over 445 sessions. These are not curated marketing examples — they are the real, unpolished prompts that produced the research. The messy ones are often the most important.
The 5-phase workflow
The phases are sequential in a project lifecycle but cyclical in practice — you return to skepticism (Phase 3) constantly, and meaning-making (Phase 4) feeds back into new experiments (Phase 2).
Exploration & setup
Open-ended questions that orient the agent to the domain and get infrastructure running. ~70 prompts in this phase across 445 sessions.
“how might i demonstrate the capacity of claude code on quantum computing? Is there an existing benchmark? Could I create one?”
This prompt started the entire project. Led to discovering Qiskit HumanEval (151 tasks).
“can you look for skills and dev setup for programming quantum computers including at tu delft quantum?”
Prompted discovery of MCP servers, QI SDK, and the Python 3.12 requirement.
“Is there a quantum random number generator based on a real quantum computer accessible via mcp?”
Led to building the QRNG MCP server (ANU → Tuna-9 → local emulator fallback chain).
“get me on github and start organizing this project as an exploration of accelerating science with generative ai, in the field of quantum inspire at TU Delft.”
Set the research framing that carried through the entire project.
“Can you search for recent graduations at qtech at tu delft with leiven vandersplein and other quantum researchers? That will give us a sense for what research questions they value”
Literature grounding — connected our work to active research directions at QuTech.
“python 3.14 is breaking everything. libqasm wont install. can we pin to 3.12?”
Python 3.14 breaks libqasm and sklearn. Pinning to 3.12 in a venv fixed all dependency issues.
“ok so we have QI. What about IBM? And there was that IQM thing... can we get on all three?”
Led to setting up IBM Quantum (Torino, 133q) and IQM Garnet (20q). Three chips, one codebase.
“build me an MCP server that wraps the QI SDK so claude code can submit circuits directly as a tool call”
Led to qi-circuits MCP server. Later built ibm-quantum and qrng servers the same way.
“what papers should we try to replicate? pick ones that used 2-9 qubits on real hardware and have enough detail to actually reproduce”
Agent selected 6 papers spanning 2014-2023. Selection criteria became part of the methodology.
“ok wait, before we do anything else — what does the landscape of AI for quantum actually look like? who else is doing this?”
Led to the adaptive experimentation literature survey. Found the experiment design gap that became our contribution.
“can you set up a daemon that runs experiments automatically? like, queue them up, submit to hardware, poll for results, analyze, store, repeat”
Led to experiment_daemon.py — 1400+ lines, the backbone of autonomous experimentation.
“I want a website. Dark theme, monospace, the whole hacker aesthetic. Show the real data live.”
Led to Next.js 14 + Tailwind site. Every number on the site is computed from real experiment data.
“what is a Hamiltonian? like really, explain it to me. And what is VQE actually doing?”
The human not knowing quantum physics turned out to be a feature, not a bug. Questions became content.
“can we use PennyLane to compute the exact ground state energy so we have a reference to compare hardware against?”
Established FCI reference values. Without this, we couldn't quantify hardware error in kcal/mol.
“set up a replication agent that can take a paper, extract the claims, and systematically test each one”
Led to replication_agent.py + replication_analyzer.py — structured claim-by-claim testing.
Running experiments
Directing experiments with a consistent pattern: start on emulator, validate, then move to hardware. ~120 prompts in this phase — the bulk of the work.
“what would be most impressive to add? e.g., if we had experiments that were running continuously and queuing up to use the qi hardware and outputting real data? That would "show" the ai science.”
Led to building the experiment daemon (auto-queue, auto-submit, auto-analyze).
“I think there is probably a workflow where we first evaluate in simulation and then move to real hardware and validate...”
Established the emulator-first validation pattern that caught most bugs before burning hardware credits.
“Queue them up and start gathering data. Be sure every experiment begins with a clear research question and purpose. After every experiment, reflect and adjust the queue to learn the most”
Turned the agent from a tool-user into a scientist — adaptive experimentation.
“save that reflection in md. then, what do you think is the next experiment you'd like to run? Other hardware? Or something else?”
Asking the agent to propose next steps produced better experiment design than prescribing them.
“Should we replicate our replications so we can see how reliable they are? Do we save the code along with the data?”
Led to the reproducibility infrastructure (SHA256 checksums, environment snapshots, variance analysis).
“run a Bell state on every qubit pair on Tuna-9. I want to know which connections actually work and how good they are.”
Topology characterization: 12/36 pairs connected, 85.8-93.5% fidelity. Some "connected" pairs in docs were actually dead.
“the topology data we had from last week is wrong. q6-8 are alive now. I think the hardware got recalibrated. characterize it fresh.”
Stale data detection. The AI starting from zero was an advantage — it didn't carry stale assumptions.
“run the same VQE at different bond distances. 0.5, 0.735, 1.0, 1.5, 2.0, 2.5 angstroms. I want to see the whole potential energy surface.”
PES sweep on Tuna-9. Found error minimum at R=1.0, NOT at equilibrium. Led to competing noise regimes discovery.
“now run it on IBM with TREX. resilience_level=1, nothing else. keep it simple.”
TREX alone: 0.22 kcal/mol. Chemical accuracy on the first try. The simplest advanced option was the best.
“ok now add dynamical decoupling on top of TREX. then add twirling. then try ZNE. I want the full mitigation ladder.”
More mitigation made things WORSE. TREX+DD: 1.33. TREX+DD+Twirl: 10.0. ZNE: 12.84. Major finding.
“run HeH+ the same way we ran H2. Same circuit shape, same shots, same everything. Just different Hamiltonian.”
Controlled comparison. HeH+ gave 91.2 kcal/mol vs H2's 1.66 kcal/mol. Same circuit, 55x worse. Why?
“can we test if it's the CNOT noise that's killing us? add extra CNOTs (gate folding) and see if the energy gets worse”
ZNE gate folding experiment: 12 runs, 3 noise levels. Extra CNOTs added <1.3 kcal/mol. Gate noise is NOT the bottleneck.
“if gate noise isnt the problem, its readout. can we calibrate the readout error? prepare |00> and |11> and see what comes back.”
Readout calibration: q2 has 9.2% error (asymmetric: 8.5% in one direction, 0.7% the other). This explained >80% of total error.
“now apply the confusion matrix correction to all our old VQE results. retroactively. see if they get better.”
Offline REM reanalysis: 21 experiments improved. Mean error dropped from 8.30 to 2.52 kcal/mol. 70% improvement for free.
“try combining post-selection with REM. post-select first then REM. then REM first then post-select. does the order matter?”
Order matters hugely. REM+PS (REM first): 2.52 kcal/mol mean. PS+REM: 3.90. Full REM alone: catastrophic on noisy runs (39 kcal/mol).
“run QAOA for MaxCut on Tuna-9. 4-node path graph. sweep gamma and beta on a 5x5 grid.”
QAOA MaxCut: 74.1% approximation ratio (5x5 sweep) vs 53.5% (single point). Parameter landscape matters.
“can we do Quantum Volume? run the standard QV protocol on Tuna-9 and see what we get.”
QV=16 on Tuna-9 (4-qubit circuits pass, 8/10 above 2/3 threshold). Compared to QV=32 on IBM and IQM.
“replicate the kicked Ising model from Kim 2023. they used 127 qubits but we can do 9. does ZNE still help at our scale?”
9-qubit kicked Ising: 180 CZ gates, ZNE gives 14.1x improvement on emulator, 2.3x on Tuna-9 hardware.
“run the H2 VQE on IQM Garnet too. same circuit, same parameters. I want three hardware points for every experiment.”
Cross-platform consistency: IBM and Tuna-9 gave nearly identical HeH+ errors (4.45 vs 4.44 kcal/mol). The coefficient ratio sets the floor, not the hardware.
“ok the daemon is on the VPS. set it up as a systemd service that auto-restarts and auto-pushes results to git.”
Daemon runs on clawdbot VPS, auto-commits results. Experiments run 24/7 without human presence.
Critical review & debugging
The most valuable prompts were skeptical ones. Every major discovery came from asking "is this actually right?" ~80 prompts in this phase.
“Act like a skeptical reviewer... can you poke holes in this? How do we know it's not AI hallucination? Trivial?”
Found that the CNOT gate implementation in our math library was broken. Bell/GHZ states were all wrong.
“can you act like a critical reviewer and look through the site and the experiments and results and try to poke holes and find inconsistencies, misconceptions, inaccuracies and other problems?”
Caught energy unit inconsistencies, stale backend names, and analysis pipeline bugs.
“yeah, actually AI did it all. I'm just prompting here. but let's fix the energy bug? Really, you didn't find any LLM bs faked data or anything?”
Honest acknowledgment that the human is directing, not coding. Important for framing.
“wait, it was that easy? Are you sure that is real?”
Asked after QEC detection code "worked" on first try. Turned out the codespace prep was missing — XXXX was giving random 50/50.
“The interesting cases are the failures. Why did Peruzzo give 83.5 kcal/mol on IBM? That's more publishable than the successes.”
Reframing failures as findings. Led to the coefficient amplification discovery (|g1|/|g4| ratio predicts error).
“the VQE energy is off by 1400 kcal/mol. that cant be a measurement error. something is fundamentally wrong with the circuit.”
X gate was on the wrong qubit. For HeH+ (g1<0, g2>0), HF has q1=1, so X goes on q1. Getting this wrong gives catastrophic error.
“are we using the right Hamiltonian coefficients? I see BK-tapered and sector-projected in the literature and they give different signs for g4 and g5.”
Critical distinction. BK-tapered has g4=g5=-0.0905 (negative theta), sector-projected has +0.0905 (positive theta). Both valid but not interchangeable.
“we said the IBM result was 9.2 kcal/mol for weeks. then we applied post-selection offline and it's actually 1.66. which number is real? why didn't the original analysis catch this?”
The stored analysis was wrong, not the data. Post-selection should have been applied from the start. Led to reanalysis of all 21 Tuna-9 results.
“the bitstring from Tuna-9 is "010110100" but which qubit is which? is it MSB first or LSB first? because if we get this wrong all our analysis is garbage.”
MSB-first: b[N-1]...b[0]. IBM is the same. Getting this wrong silently produces wrong energies that look plausible.
“the non-contiguous qubit pairs are giving nonsense results. q[2,4] works but q[4,6] gives random data. is the extraction code correct?”
Bug: bits[-2:] only works for q[0,1]. Non-contiguous pairs need explicit index extraction. Fixed with _extract_qubit_bits() helper.
“why is ZNE making things worse on both platforms? the textbook says it should help.”
ZNE amplifies gate noise and extrapolates to zero. But when readout error is 80%+ of total error, there's nothing useful to extrapolate. Flat/non-monotonic trend confirmed.
“the confusion matrix correction made one run go from 8 to 39 kcal/mol. how is correction making it WORSE?”
Full REM redistributes probability to all states including wrong-parity ones. When gate noise is high, this amplifies errors. Hybrid PS+REM catches it.
“these results are suspiciously good. 0.22 kcal/mol on real hardware? that's better than the paper. double check everything.”
It was real. TREX is genuinely that effective for shallow circuits with readout-dominated noise. The key insight: match mitigation to noise type.
“the QEC code says it's detecting errors but the detection rate is 50/50 for Z errors. that's just random. what's going on?”
Z errors don't flip data bits, so Z-basis measurement can't localize them. Fundamental limitation, not a bug. Led to understanding NN decoder vs lookup table.
“we need the [[4,2,2]] code but each ancilla has to CNOT all 4 data qubits. Tuna-9's max degree is 3. this literally can't work.”
Original characterization found max degree=3. Later discovered full 12-edge topology where q4 has degree 4 — enabling [[4,2,2]] with q4 as sole ancilla. Achieved 66.6% detection rate (10 CNOTs via q4 bus). IBM Torino still better at 92.7% (6 CNOTs, richer connectivity).
“HeH+ has the same circuit as H2 but 20x worse error. same hardware, same shots, same qubits. what's different?”
Coefficient amplification. |g1|/|g4| ratio: H2=4.4, HeH+=7.8. 1.8x ratio increase → 20x error increase. The Hamiltonian structure, not the hardware, sets the error floor.
“our IBM result was 91.2 kcal/mol for HeH+. then we switched from SamplerV2+PS to EstimatorV2+TREX and got 4.45. which analysis pipeline was broken?”
SamplerV2+PS was applying post-selection to already-biased data without readout correction. EstimatorV2+TREX handles readout internally. The first pipeline was producing garbage.
“post-selection keeps 98% for H2 but only 84% for HeH+. 16% of HeH+ shots are leaking out of the parity sector. why?”
HeH+ ansatz state requires more entanglement (optimal alpha further from reference). Higher entanglement = more gate noise = more parity leakage.
Meaning-making & communication
Turning raw results into understanding. Visualization, sonification, narrative framing. ~50 prompts in this phase.
“coherence... with what? and the microwave frequency, how is that tuned? Is it cold because that way it is in the lowest energy level? Are these energy levels like atoms?”
Genuine curiosity prompts produced the best educational content. The How Qubits Work series came from questions, not directives.
“I actually want to turn this into a resonance-based explanation of how quantum computers work. Like, we need animations of microwave pulses and the whole deal.”
Led to the /how-it-works resonance explainer — the most distinctive page on the site.
“lets make a set of smaller units that explore sonification in a different way. Can you brainstorm how we might sonify the data?”
Led to quantum circuit sonification — hearing the difference between clean emulator and noisy hardware.
“Think about it from a QDNL perspective again. I think the AI accelerated science hits. The AI as interface to quantum computing hits...”
Stepping back to check alignment with stakeholders. Kept the project grounded.
“it's like AI is the interface between humans and quantum...”
The one-sentence thesis that emerged from all of this. Sometimes the best prompt is a half-formed thought.
“build a 3D Bloch sphere in three.js. let people drag it around. show how gates rotate the state.”
Led to the interactive Bloch sphere — one of the most visited pages. Three.js + real-time state computation.
“I want a page where people can see what measurement actually does. like, show the Born rule happening in real time as they change the state.”
Led to the Measurement Lab. Interactive visualization of collapse, probability, and repeated measurement.
“can we put all the experiment results in a dashboard? grouped by type. energy diagrams for VQE. backend badges showing which chip.”
Led to /experiments page. Every datapoint is a real hardware run with JSON source data.
“make a page that compares all three chips head to head. same metrics, same layout. I want to see the tradeoffs at a glance.”
Led to /platforms. Tuna-9 beats Garnet on GHZ-5 (83.8% vs 81.8%). Topology beats scale.
“we need a blog. each discovery should be a post. write up the error mitigation showdown first — the one where we tested 15 techniques.”
Led to 14 blog posts. Each one written from real experiment data, not hypothetical examples.
“the replication results need their own page. show each paper, the claims we tested, pass/fail on each backend. make the gaps visible.”
Led to /replications dashboard. Claim-by-claim comparison across 4 backends for 6 papers.
“I think the site should have three pillars. Research is the center. Learn is the tools. VibeCoding is the method. everything flows from that.”
Information architecture that organized 34 pages into a coherent narrative.
“what does the paper outline look like? can you draft a structure? we have enough data for a real publication.”
Led to paper-outline.md — structured for Quantum Science and Technology. Abstract, methods, 8 figures, appendices.
“write a failure analysis for HeH+. not just "it failed" but why, quantitatively. error budget decomposition. term by term.”
Led to failure-analysis.md. Discovered g1*delta_Z0 predicts energy error. Made the testable prediction that IBM TREX would give 3-4x improvement but not chemical accuracy. Prediction confirmed.
“the coefficient amplification thing — can we make a plot? ratio on x axis, error on y axis, all 30+ data points, both molecules, all backends.”
Led to amplification-threshold-analysis.json. Superlinear scaling: 1.8x ratio → 20x error. Chemical accuracy threshold at ratio ~5.
“we have a glossary but it needs to be more than definitions. link each term to the experiment where we actually used it.”
Led to learn page glossary with 40+ terms, 7 categories, each connected to real experimental context.
“what would happen if someone came to this site knowing nothing? walk me through it as a quantum researcher. then as a student. then as an AI person.”
UX persona walkthroughs. Found that the three-pillar structure serves all three audiences.
Session management & debugging
The prompts nobody shows in demos but everyone types in real sessions. ~30 prompts in this phase.
“taking a really long time. i cant tell if you are working or not... how come? like i wish there was better status about what you are working on when it is taking so long”
Led to adding run_in_background and better status feedback patterns to CLAUDE.md.
“why would it error and not let me know? it just hung up”
Discovered pipe buffering issue. Fix: never pipe long-running commands. Now in our CLAUDE.md.
“its not about the content, its just -- 5 min for 1000 tokens? Something else is going on, maybe you can't see it. I need more visibility”
When the agent is slow, the problem is usually infrastructure, not the LLM. Diagnose the pipeline.
“you are stuck.”
Sometimes the best prompt is two words. The agent had been looping on a failed approach for several turns.
“I was working on something in this window but I can't see it anymore”
Context window management is real. Led to session handoffs and /compact discipline.
“vercel --prod is giving me a build error about deploymentProtection in vercel.json. but we need it.”
Vercel CLI rejects deploymentProtection in vercel.json. Must use project settings instead. Documented in CLAUDE.md.
“the hardware job has been PLANNED for 20 minutes. is the queue stuck? should we cancel and resubmit?”
Tuna-9 queues can stall. Sometimes resubmission works, sometimes the backend is down. Patience or pivot.
“save a handoff. I need to close this window. list what files we changed, what state we are in, what to do next.”
Led to handoff files in .claude/handoffs/. Critical for multi-day projects with context limits.
“/compact at 75% please. preserve the experiment results and the current task state.”
Discipline around context window management. Without it, the agent loses track of earlier work.
“the MCP server isn't connecting. qi_list_backends returns nothing. is it a config issue or is the service down?”
MCP config lives in ~/.claude/. Authentication tokens expire. Always check config.json first.
“the daemon committed 47 JSON files but half of them have duplicate experiment IDs. what happened?”
Race condition in auto-commit. Fixed by adding locking and deduplication to the daemon pipeline.
“don't pipe the build output. just run it. i'll wait. last time the pipe buffering made it look frozen for 3 minutes.”
Hard-won lesson now in CLAUDE.md: never pipe long-running commands. Output buffering kills feedback.
“this session is getting long. can you write a memory note about what we learned about coefficient conventions so we dont have to rediscover it next time?”
Led to MEMORY.md entries that persist across sessions. The H2/HeH+ Hamiltonian coefficients are now documented permanently.
What this reveals
The human is the scientist
The AI writes all the code, runs all the circuits, and analyzes all the data. But the prompts that matter most are skeptical questions (Phase 3) and half-formed intuitions (Phase 4). The human's job is judgment, not implementation.
Curiosity beats directives
The best educational content came from genuine questions (“coherence... with what?”), not specifications. The best experiment designs came from asking the agent what it would investigate next, not prescribing the protocol.
Failures are the findings
Every major discovery — coefficient amplification, the TREX mitigation ladder, topology-vs-scale — came from investigating failures, not celebrating successes. The prompt “why did this fail?” is more productive than “make this work.”
Phase 5 is honest
Real AI-assisted work includes “you are stuck,” “why is this taking so long,” and “I can't see what you're doing.” The friction is part of the method. Acknowledging it makes the work reproducible.