QASM-Eval: A Dataset to Train and Evaluate LLMs on OpenQASM-3 Beyond Quantum Circuits

arXiv cs.LG Papers

Summary

Introduces QASM-Eval, the first comprehensive dataset for training and evaluating LLMs on OpenQASM 3 programs with hardware-oriented features, including expert-verified test and training sets. Evaluation shows that fine-tuning on QASM-Eval significantly improves LLM performance, with Llama-3-70B achieving 85% pass@1, outperforming few-shot GPT-5.2.

arXiv:2605.30358v1 Announce Type: new Abstract: Quantum computing remains in the Noisy Intermediate-Scale Quantum (NISQ) era, where the performance is highly constrained to noise. Addressing the limitation often requires hardware-facing capabilities beyond gate-sequence circuit specification, including mid-circuit measurement and classical feedback for quantum error correction (QEC), precise timing control for dynamical decoupling (DD), and pulse-level waveform access for calibration. OpenQASM-3 was introduced to expose exactly these capabilities, providing a hardware-level programming interface. However, despite the rapid progress of large language models in code generation, there is still no dataset specifically designed to train and evaluate LLMs on OpenQASM-3 programs that involve its advanced hardware-oriented features. To address this gap, we introduce QASM-Eval, the first comprehensive dataset designed to train and evaluate LLMs on OpenQASM-3. Rather than focusing on quantum algorithm design or reasoning, QASM-Eval explicitly targets the language's hardware-facing features. QASM-Eval comprises an expert-verified test set of 100 tasks and a training set of 4,000 tasks, systematically covering classical logic, timing scheduling, pulse control, and complex real-world workflows. To automatically validate generated programs, we check syntax, quantum states and program timeline using an extended verifier. Our evaluation reveals that while state-of-the-art LLMs struggle heavily in OpenQASM-3 coding tasks, targeted fine-tuning on QASM-Eval yields significant gains. QASM-Eval provides a crucial benchmark and training foundation to accelerate the development of reliable LLM assistants for hardware-facing quantum programming in NISQ era. Data and code: https://github.com/fuzhenxiao/QASM-Eval
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:22 AM

# QASM-Eval: A Dataset to Train and Evaluate LLMs on OpenQASM-3 Beyond Quantum Circuits
Source: [https://arxiv.org/html/2605.30358](https://arxiv.org/html/2605.30358)
###### Abstract

Quantum computing remains in the Noisy Intermediate\-Scale Quantum \(NISQ\) era, where the performance is highly constrained to noise\. Addressing the limitation often requires hardware\-facing capabilities beyond gate\-sequence circuit specification, including mid\-circuit measurement and classical feedback for quantum error correction \(QEC\), precise timing control for dynamical decoupling \(DD\), and pulse\-level waveform access for calibration\. OpenQASM 3 was introduced to expose exactly these capabilities, providing a hardware\-level programming interface\. However, despite the rapid progress of large language models in code generation, there is still no dataset specifically designed to train and evaluate LLMs on OpenQASM 3 programs that involve its advanced hardware\-oriented features\.

To address this gap, we introduce QASM\-Eval, the first comprehensive dataset designed to train and evaluate LLMs on OpenQASM 3\. Rather than focusing on quantum algorithm design or reasoning, QASM\-Eval explicitly targets the language’s hardware\-facing features\. QASM\-Eval comprises an expert\-verified test set of 100 tasks and a training set of 4,000 tasks, systematically covering classical logic, timing scheduling, pulse control, and complex real\-world workflows\. To automatically validate generated programs, we check syntax, quantum states and program timeline using an extended verifier\. Our evaluation reveals that while state\-of\-the\-art LLMs struggle heavily in OpenQASM 3 coding tasks, targeted fine\-tuning on QASM\-Eval yields significant gains\. Specifically, fine\-tuning Llama\-3\-8B approaches zero\-shot GPT\-5\.2 performance, while Llama\-3\-70B achieves an 85% overall pass@1, outperforming few\-shot\-augmented GPT\-5\.2\. QASM\-Eval provides a crucial benchmark and training foundation to accelerate the development of reliable LLM assistants for hardware\-facing quantum programming in NISQ era\. Data and code:[https://github\.com/fuzhenxiao/QASM\-Eval](https://github.com/fuzhenxiao/QASM-Eval)

## 1Introduction

Quantum computing has shown advantage in domains such as chemical simulation\[[12](https://arxiv.org/html/2605.30358#bib.bib19),[32](https://arxiv.org/html/2605.30358#bib.bib20)\], optimization\[[1](https://arxiv.org/html/2605.30358#bib.bib21)\], and quantum machine learning\[[13](https://arxiv.org/html/2605.30358#bib.bib22),[7](https://arxiv.org/html/2605.30358#bib.bib23)\]\. Yet practical quantum hardware remains in the Noisy Intermediate\-Scale Quantum \(NISQ\) regime\[[15](https://arxiv.org/html/2605.30358#bib.bib24),[38](https://arxiv.org/html/2605.30358#bib.bib25)\],where quantum processors still suffer from quantum noise, a stochastic disturbance corrupting quantum states and causing computational deviations\. Although numerous approaches exist to mitigate the effects of noise, such as quantum error correction \(QEC\)\[[19](https://arxiv.org/html/2605.30358#bib.bib35)\], dynamical decoupling \(DD\)\[[28](https://arxiv.org/html/2605.30358#bib.bib36)\], and routine calibration\[[29](https://arxiv.org/html/2605.30358#bib.bib37)\], each of them requires specific, low\-level hardware controls\. First, QEC relies on mid\-circuit measurement and runtime classical computation/feedback\[[42](https://arxiv.org/html/2605.30358#bib.bib38)\]\. Second, intervention methods such as DD are highly dependent on precise control over gate timing\[[10](https://arxiv.org/html/2605.30358#bib.bib39)\]\. Third, because qubits naturally drift over time and calibrated gate fidelities keep degrading\[[9](https://arxiv.org/html/2605.30358#bib.bib45),[39](https://arxiv.org/html/2605.30358#bib.bib46)\], sustaining performance requires not only routine calibration but also pulse\-level access to actively tune control waveforms\. However, existing high\-level quantum programming tools such as Qiskit\[[22](https://arxiv.org/html/2605.30358#bib.bib15)\], Cirq\[[17](https://arxiv.org/html/2605.30358#bib.bib16)\], and PennyLane\[[5](https://arxiv.org/html/2605.30358#bib.bib17)\], lack comprehensive support for these fine\-grained hardware controls\.

OpenQASM 3\[[16](https://arxiv.org/html/2605.30358#bib.bib18)\]addresses these limitations by serving as a hardware\-aware intermediate representation that links algorithms to physics\. Unlike high\-level tools, OpenQASM 3 exposes hardware instructions that directly fulfill the diverse operations required for noise mitigation\. First, it supports hardware\-embedded classical logic and control flow, enabling runtime mid\-circuit operations\. Second, it introduces explicit gate timing and scheduling constructs that allow dynamic operation duration, alignment, and delay insertion\. Finally, OpenQASM 3 further extends to pulse\-level control, allowing developers to describe or tune physical control waveforms directly, thus providing users with a means to actively manage calibration details\. Taken together, these capabilities make OpenQASM 3 a key enabler for improving quantum\-computing performance in the NISQ era\.

Given the increasing complexity and features of OpenQASM 3, as well as the rapid progress of Large Language Models \(LLMs\) for code generation\[[24](https://arxiv.org/html/2605.30358#bib.bib30),[44](https://arxiv.org/html/2605.30358#bib.bib29),[36](https://arxiv.org/html/2605.30358#bib.bib31),[25](https://arxiv.org/html/2605.30358#bib.bib28)\], using LLMs to assist OpenQASM programming is a natural next step\. However, this direction is currently limited by the lack of appropriate datasets\. Existing OpenQASM\-related resources fall into two unsatisfactory categories\. Some datasets, such as Veri\-Q\[[14](https://arxiv.org/html/2605.30358#bib.bib8)\]and QASMBench\[[33](https://arxiv.org/html/2605.30358#bib.bib7)\], were designed to benchmark quantum algorithms or hardware platforms rather than to support LLM training or evaluation\. Others, such as QCircuitBench\[[47](https://arxiv.org/html/2605.30358#bib.bib6)\]and Agent\-Q\[[23](https://arxiv.org/html/2605.30358#bib.bib55)\], remain confined to gate\-sequence circuit generation and do not cover the critical features of OpenQASM 3, including classical logic, timing scheduling, and pulse control\. As summarized in Table[1](https://arxiv.org/html/2605.30358#S1.T1), no existing resource simultaneously targets LLM\-based code generation and captures the core features that define OpenQASM 3’s central role in improving NISQ\-era quantum\-computing performance\.

OpenQASM datasetsLLM\-targetedClassical LogicTiming SchedulingPulse ControlOur work \(QASM\-Eval\)✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmarkQCircuitBench\[[47](https://arxiv.org/html/2605.30358#bib.bib6)\]✓\\checkmark×\\times×\\times×\\timesAgent\-Q\[[23](https://arxiv.org/html/2605.30358#bib.bib55)\]✓\\checkmark×\\times×\\times×\\timesVeri\-Q\[[14](https://arxiv.org/html/2605.30358#bib.bib8)\]×\\times×\\times×\\times×\\timesQASMBench\[[33](https://arxiv.org/html/2605.30358#bib.bib7)\]×\\times×\\times×\\times×\\timesTable 1:Comparison of representative OpenQASM\-related resources by whether they target LLM code generation and whether they cover major OpenQASM 3 feature dimensions\.To address these gaps, we proposeQASM\-Eval, a dataset of OpenQASM 3 coding tasks where critical logic segments are replaced by natural language prompts for LLMs to complete, paired with canonical solutions\. Our main contributions are:

- •We construct an OpenQASM 3 task suite for both LLM training and evaluation\. We release a test set with 100 tasks, a training set with 4000 tasks, and two targeted fine\-tuned models\. The dataset covers key OpenQASM 3 capabilities, including classical logic/control flow, timing constraints, and pulse\-level control\. To validate these features, we extend existing toolchains with new OpenQASM 3 support so that generated code can be automatically verified for syntactic, semantic, and scheduling constraints\.
- •Using the QASM\-Eval test set, we analyze the limitations of current models on OpenQASM 3 tasks and evaluate the effectiveness of our training data for fine\-tuning: on Llama\-8B and Llama\-70B, our fine\-tuning improves pass@1 by 28–58%; the fine\-tuned 8B model approaches the zero\-shot performance of GPT\-5\.2, while the fine\-tuned 70B model exceeds few\-shot\-augmented GPT\.

## 2Related Work

#### Quantum Computing in the NISQ Era

Practical quantum computing remains constrained by the NISQ regime, where device performance is limited by quantum noises originated from various sources, such as state\-preparation and measurement \(SPAM\) errors\[[43](https://arxiv.org/html/2605.30358#bib.bib50)\], imperfect gate implementations\[[41](https://arxiv.org/html/2605.30358#bib.bib51)\], crosstalk\[[2](https://arxiv.org/html/2605.30358#bib.bib52)\]and decoherence\[[30](https://arxiv.org/html/2605.30358#bib.bib43)\], which accumulate as circuits deepen\. A broad line of work therefore focuses on noise mitigation and suppression\. Quantum error correction \(QEC\)\[[19](https://arxiv.org/html/2605.30358#bib.bib35),[42](https://arxiv.org/html/2605.30358#bib.bib38)\]provides a principled route to fault tolerance by encoding logical qubits into multiple physical qubits, checking the status of qubit via mid\-circuit measurement, and locating potential errors via classical computation\[[3](https://arxiv.org/html/2605.30358#bib.bib53),[6](https://arxiv.org/html/2605.30358#bib.bib54)\]\. Dynamical decoupling \(DD\)\[[28](https://arxiv.org/html/2605.30358#bib.bib36),[10](https://arxiv.org/html/2605.30358#bib.bib39)\]reduces noise by inserting carefully timed pulse sequences to average out low\-frequency noise, which makes its highly sensitive to precise control over operation timing and spacing, especially in dynamic circuits\. Calibration is also critical\. Although providers routinely recalibrate devices to maintain fidelity\[[29](https://arxiv.org/html/2605.30358#bib.bib37)\], qubit frequencies, gate parameters, and readout characteristics still keep drifting between calibrations\[[9](https://arxiv.org/html/2605.30358#bib.bib45),[39](https://arxiv.org/html/2605.30358#bib.bib46)\]\. Therefore, advanced users may additionally require direct access to customized control waveforms\[[27](https://arxiv.org/html/2605.30358#bib.bib41),[35](https://arxiv.org/html/2605.30358#bib.bib40)\]\. Yet these hardware\-facing requirements are only weakly exposed in most high\-level quantum software stacks\.

#### OpenQASM 3 Language

OpenQASM 3\[[16](https://arxiv.org/html/2605.30358#bib.bib18)\]was introduced to bridge this gap between high\-level program specification and low\-level hardware execution\. Its most important advance over earlier circuit\-description formats is support for embedded classical computation and control flow, including branching and runtime decisions based on measurement outcomes, which makes mid\-circuit adaptive protocols expressible within the language itself\. OpenQASM 3 also incorporates explicit timing and scheduling constructs, allowing programmers to control operation duration, alignment, and inserted delays with much greater precision; this is essential for expressing temporally sensitive techniques such as DD or hardware\-aware gate orchestration\. In addition, the language extends toward pulse\-level control, enabling the specification and tuning of physical control waveforms needed for calibration\-sensitive experiments and custom hardware manipulation\. These features position OpenQASM 3 as a practical interface for NISQ\-era programs that must interact closely with device physics rather than remain at the level of abstract circuits\.

#### OpenQASM Datasets

The application of LLMs to quantum programming is an emerging field, and current datasets reflect a strong bias toward high\-level SDK ecosystems rather than intermediate representations\. Datasets such as QDataset\[[37](https://arxiv.org/html/2605.30358#bib.bib9)\], Qiskit\-HumanEval\[[45](https://arxiv.org/html/2605.30358#bib.bib10)\], QuanBench\[[20](https://arxiv.org/html/2605.30358#bib.bib11)\], and MQTBench\[[40](https://arxiv.org/html/2605.30358#bib.bib13)\]primarily evaluate host\-side Python code generation\. Among resources targeting OpenQASM directly, Veri\-Q\[[14](https://arxiv.org/html/2605.30358#bib.bib8)\]and QASMBench\[[33](https://arxiv.org/html/2605.30358#bib.bib7)\]focus on compiler optimization and hardware benchmarking using static circuit files\. QCircuitBench\[[47](https://arxiv.org/html/2605.30358#bib.bib6)\]and Agent\-Q\[[23](https://arxiv.org/html/2605.30358#bib.bib55)\]are the closest antecedents for LLM research, as QCircuitBench pairs circuits with natural\-language descriptions and Agent\-Q includes various circuits designed specifically for optimization problems; however, they remain restricted to basic, gate\-sequence scripts\. Consequently, the literature currently lacks any benchmark that captures the dynamic, hardware\-facing features of OpenQASM 3—specifically classical logic, explicit scheduling, and pulse control\.

## 3QASM\-Eval Dataset

In this section, we introduce QASM\-Eval\. To our knowledge, QASM\-Eval is the first dataset designed for LLMs that targets OpenQASM 3 and its advanced hardware\-level features beyond specific quantum circuits\. The test set contains 100 OpenQASM 3 quantum\-coding tasks spanning diverse themes, while the training set can be generated at scale \(we generate 4,000 tasks for fine\-tuning in this work, and our released code enables further scalable generation\.\), together with a corresponding simulation\-based testbed\. We elaborate two key aspects: \(1\) To comprehensively cover the new features introduced by OpenQASM 3, our dataset has three major task categories including classical logic, timing scheduling, and pulse control, as well as one extra category of challenging complex tasks that integrate all three categories based on realistic appplications, such as QEC, DD and calibration \(2\) To support large\-scale data generation while preserving correctness, we adopt a dataset\-construction pipeline that combines curated templates, LLM\-assisted augmentation, and expert review, a strategy that has proven effective in prior LLM\-dataset works\[[46](https://arxiv.org/html/2605.30358#bib.bib47),[34](https://arxiv.org/html/2605.30358#bib.bib48),[26](https://arxiv.org/html/2605.30358#bib.bib49)\]\.

### 3\.1Task Category

QASM\-Eval comprises four task categories as listed in Table[2](https://arxiv.org/html/2605.30358#S3.T2)\. Three categories target core new capabilities in OpenQASM 3: \(1\)*classical\-logic*tasks that exercise classical control and computation, \(2\)*timing scheduling*tasks that focus on timing and scheduling primitives, and \(3\)*pulse control*tasks that involve low\-level pulses, calibrations, and related functionality\. In addition, we include \(4\)*complex*category that composes all features into more challenging, real\-world problems such as QEC, DD and calibration\. Further details can be found in Appendix[D](https://arxiv.org/html/2605.30358#A4)\.

Table 2:Task taxonomy of QASM\-Eval with four categories\. Each category contains 25 tasks for testing and 1000 tasks for trainingcategory\# test\# traininvolved features/functionalitiesclassical251000if/else, mid\-circuit measurement, while loop, for loop, switch statement, arithmetic calculation, dynamic unit, dynamic data type, type casting, array, dynamic comparison, bit\-wise operation, external functionstiming251000delay, duration, hybrid units, stretch, multiple stretch, alignment, proportional arrangement, dynamic duration, box operations, barrierpulse251000wave calibrate/rewrite/measure, shift phase, modulation, custom waveforms, frame sync, param gate, multiplex readout, phase trackingcomplex251000all of above \+ real\-world application scenarios including QEC, DD, calibration, RAMSEY, Hahn echo, parity check, crosstalk detection, …#### Classical Logic Tasks

This category places quantum circuits within an executable classical\-control framework\. Tasks typically require using conditional branches to drive quantum operations, or computing over and iterating through classical objects to decide subsequent circuit behavior\. For instance, a task may compare mid\-circuit measurement outcomes against classical data to enter anif/elsebranch, or usewhileloops to repeatedly probe or steer qubit states, withbreakandcontinueto constrain real\-time control flow\. We also incorporate standard classical instructions such as Boolean comparisons, bitwise operations \(including shifts andpopcount\), and type casting to ensure that each problem exercises a nontrivial coupling of classical logic and quantum execution\.

#### Timing Scheduling Tasks

This category focuses on program\-level scheduling of quantum operations\. Basic tasks require explicitly insertingdelayinstructions and using thedurationtype to compose timing constants and arithmetic at compile time\. Tasks may also usedt, a backend\-defined unit tied to the sampling period, to avoid mismatches between waveform sampling rates and time resolution\. More advanced tasks involve*stretchable*time intervals whose values are solved by the compiler without relying on specific calibration lengths, enabling constraints such as alignment, uniform spacing, or “as\-late\-as\-possible” placement\. For example, combiningstretchwithbarriercan enforce that selected operations complete simultaneously\.

#### Pulse Control Tasks

This category targets low\-level, executable pulse programs\. Tasks require manipulating waveforms under explicit resource constraints, selecting the correct control targets, initializing sequence objects, executing pulses, and performing readout\. For example, they exercise*ports*as hardware\-facing I/O abstractions and*frames*as stateful containers that track clocking, carrier phase, and frequency\. Programs update these states via instructions such asset\_phase,shift\_phase, andset\_frequency, and then schedule a waveform onto a frame usingplay\. We further include composition operators such asmix,sum,phase\_shift, andscaleto cover modulation, phase tracking, custom waveform synthesis, and multiplexed readout\.

#### Complex Tasks

This category composes at least two of the three categories above \(classical logic, timing, and pulse control\) to approximate realistic experimental workflows, including typical scenarios such as QEC, DD, and calibration\. Representative tasks includesyndrome\_feedforward\_idle\_schedulingfor QEC, which integrates pulse\-level syndrome capture with conditional feedforward logic and precise idle padding \(covers classical logic and pulse control\)\. Other examples areboxed\_dynamic\_decouplingfor DD, which embeds a decoupling pulse sequence within a fixed time window using elastic gaps \(covers timing and pulse control\), andmeasurement\_crosstalk\_calibrationfor calibration, which evaluates hardware crosstalk by interleaving simultaneous driving, readout, and conditional bitwise operations \(covers all three categories\)\. These tasks involve a broader set of interacting features and are therefore well\-suited for evaluating LLM performance in practice\-oriented quantum programming settings\.

### 3\.2Dataset Construction

![Refer to caption](https://arxiv.org/html/2605.30358v1/x1.png)Figure 1:Overview of the QASM\-Eval construction pipeline\.#### Template Preparation

To ensure both the diversity and correctness of QASM\-Eval, we first ask quantum\-programming experts to design two types of templates\(Figure[1](https://arxiv.org/html/2605.30358#S3.F1)\-a\)\. The first type, background templates, randomizes the circuit context \(e\.g\., qubit configurations, default frequencies, etc\.\) and applies random gate sequences to diversify the initial quantum state\. The second type, core\-task templates, targets specific themes likeif/elsecontrol flows or pulse operations, and leaves placeholders for operands and parameters to be filled in later through randomized selection\.

#### Instance Generation

Instance generation process is shown in Figure[1](https://arxiv.org/html/2605.30358#S3.F1)\-b\. Based on templates, we generate randomized background circuits \(BGs\) with diverse context such as qubit configurations\. Then, to further increase diversity of core tasks, we use coder\-LLM \(Qwen3\-Coder\-480B\) augmented with OpenQASM 3 documentation to create stylistic and structural variants for each core task template \(see examples in Appendix[E](https://arxiv.org/html/2605.30358#A5)\)\. All template variants were reviewed by experts to ensure correctness\. Conditioned on BGs, the core generator selects appropriate objects, operands and parameters, instantiating a theme\-specific core\-task template \(including variants\) to produce a complete OpenQASM 3 program\. We also append an explicit observation step \(qubit measurements\) to ensure task outcomes are well\-defined and verifiable\.

##### Natural Language Prompt Generation

Finally, we assemble the dataset into specification–solution pairs \(Figure[1](https://arxiv.org/html/2605.30358#S3.F1)\-c\)\. Specifically, we merge a sampled background with one core task to form a complete OpenQASM program\. We then use LLM to generate a natural\-language prompt describing the core task, embedding it into the program as a TODO comment\. This yields the final paired data: \(i\) the specification \(background \+ TODO comment\) and \(ii\) the reference solution \(background \+ core\-task code\)\. To prevent data leakage, we assign one variant per theme to the test set and allocate all remaining variants to the training set\. Additionally, experts independently verify all 100 test problems, and we spot\-check 400 training instances to ensure the specifications are unambiguous, correct, and free of answer leakage\.

### 3\.3Evaluation Method

We evaluate model\-generated solutions with a automated verifier that checks both*syntax*and*semantics*\. Semantic verification is further decomposed into two complementary views: the*final quantum state*induced by the program and the*execution schedule*implied by timing primitives\. To support these checks under OpenQASM 3 features that are not fully covered by existing toolchains\[[21](https://arxiv.org/html/2605.30358#bib.bib34)\], we implement a quantum\-simulation\-based automatic verifier as shown in Figure[1](https://arxiv.org/html/2605.30358#S3.F1)\-d\.

#### Syntax checking\.

We parse each submitted program using the official OpenQASM parser and obtain an abstract syntax tree \(AST\)\. The verifier inspects statement types and their constituents to ensure that the program is well\-formed and adheres to the OpenQASM 3 syntax\. In addition, we explicitly check whether the construct elements \(e\.g\. a While loop structure or Switch structure\)required by the taskTODOprompt are present in the submitted solution\. Failed samples are categorized assyntax errorsorelement errors\.

#### Quantum\-state checking\.

Because common frameworks such as Qiskit\[[22](https://arxiv.org/html/2605.30358#bib.bib15)\]and Cirq\[[17](https://arxiv.org/html/2605.30358#bib.bib16)\]only support a subset of OpenQASM 3 functionality, we first extend our frontend with new OpenQASM 3 semantics, and then use statevector simulation \(built on Qiskit\) and pulse\-level simulation \(built on Qutip\[[31](https://arxiv.org/html/2605.30358#bib.bib56)\]\) to compute the final quantum state\. We compare the simulated quantum state distribution against the reference behavior defined by the task\. Failed samples are categorized asdistribution errors\.

#### Schedule checking\.

To validate timing behavior, we assign default durations to supported OpenQASM 3 operations and expose an interface to override them when needed\. After interpreting OpenQASM 3 timing constructs, the verifier computes a software schedule and produces a virtual timeline, which is then checked against the task’s scheduling constraints\. Failed samples are categorized astimeline errors\.

#### Pass criterion and metrics\.

Syntax checking is mandatory for all tasks\. Quantum\-state checking is applied to*classical\-logic*and*pulse\-control*tasks, while schedule\-based checking is applied to*timing scheduling*tasks\. For*complex*tasks, both state and schedule checks must pass\. We summarize the overall pass condition as:

Pass​\(x\)=Syn​\(x\)∧\(¬𝕀state​\(x\)∨State​\(x\)\)∧\(¬𝕀sched​\(x\)∨Sched​\(x\)\),\\mathrm\{Pass\}\(x\)=\\mathrm\{Syn\}\(x\)\\wedge\\Big\(\\neg\\mathbb\{I\}\_\{\\mathrm\{state\}\}\(x\)\\,\\vee\\,\\mathrm\{State\}\(x\)\\Big\)\\wedge\\Big\(\\neg\\mathbb\{I\}\_\{\\mathrm\{sched\}\}\(x\)\\,\\vee\\,\\mathrm\{Sched\}\(x\)\\Big\),whereSyn​\(x\)\\mathrm\{Syn\}\(x\)denotes successful parsing/structural validation,State​\(x\)\\mathrm\{State\}\(x\)denotes passing the quantum\-state check, andSched​\(x\)\\mathrm\{Sched\}\(x\)denotes passing the schedule check\. The indicator functions𝕀state​\(x\)\\mathbb\{I\}\_\{\\mathrm\{state\}\}\(x\)and𝕀sched​\(x\)\\mathbb\{I\}\_\{\\mathrm\{sched\}\}\(x\)specify whether a task instance requires state and schedule verification, respectively \(e\.g\., both are true for complex tasks\)\.

We report performance using the standard pass@kkmetric\. Givennngenerated samples for a task andcccorrect ones among them, we use the unbiased estimator:

pass​@​k=1−\(n−ck\)\(nk\)\\mathrm\{pass@\}k=1\-\\dfrac\{\\binom\{n\-c\}\{k\}\}\{\\binom\{n\}\{k\}\}

#### Evaluation Quality

We assess the fidelity of our simulation and validation pipeline by comparing its judgments against expert annotations\. Specifically, we collect ChatGPT\-5\.2\-Thinking outputs on the 100\-task test set and label each solution as correct/incorrect using \(i\) our automated verifier \(*Verifier*\) and \(ii\) independent quantum\-programming experts \(*Expert*\)\. As shown in Table[3](https://arxiv.org/html/2605.30358#S3.T3), the two agree on 92 out of 100 tasks\. For the 8 disagreements, the primary source is ambiguity in interpreting prompt requirements in theTODOdescription \(Detailed examples in Appendix[F](https://arxiv.org/html/2605.30358#A6)\)\. Overall, the agreement rate is 0\.92 and Cohen’sκ\\kappais 0\.837, indicating strong reliability\.

Table 3:Agreement between the automated verifier \(*Verifier*\) and human experts \(*Expert*\) on 100 test\-set solutions produced by ChatGPT\-5\.2\-Thinking\.solutions from GPT\-5\.2expert\-passexpert\-failalignment ratekappaverifier\-pass5250\.920\.837verifier\-fail340

## 4Experiments

### 4\.1Experimental Setup

#### Models and Methods

\(1\) We first evaluate QASM\-Eval on a range of models, including a large reasoning\-oriented LLM \(ChatGPT\-5\.2\-Thinking; denotedgpt\-5\.2\), DeepSeek\-V3\-0324 \(dpsk\-v3\), a medium\-scale open\-source model Llama\-3\.3\-70B\-Instruct \(llama70b\-base\), and a small open\-source model Meta\-Llama\-3\.1\-8B\-Instruct \(llama8b\-base\)\. Other tested models can be found in Appendix[H](https://arxiv.org/html/2605.30358#A8)\. \(2\) In addition to the base models, we evaluate a few\-shot prompting setting\[[8](https://arxiv.org/html/2605.30358#bib.bib33)\], has been shown to improve code\-generation and code\-editing performance\. For each test problem, we provide three example prompt–solution pairs drawn from the same theme but different code variants\. We denote them asgpt\-5\.2\-fs,dpsk\-v3\-fs,llama70b\-fs, andllama8b\-fs\. \(3\) To assess the utility of the QASM\-Eval training split, we further perform 3 LoRA fine\-tuning variants for both Llama\-70B and Llama\-8B: \(i\) fine\-tuning on the prior QCircuitBench dataset\[[47](https://arxiv.org/html/2605.30358#bib.bib6)\]\(llama70b\-QCB,llama8b\-QCB\); \(ii\) fine\-tuning on the prior Agent\-Q dataset\[[23](https://arxiv.org/html/2605.30358#bib.bib55)\]\(llama70b\-AGQ,llama8b\-AGQ\) and \(iii\) fine\-tuning on the QASM\-Eval training set \(llama70b\-ours,llama8b\-ours\)\. For each model and each task, we sample 5 independent solutions and compute pass@kkaccordingly\.

#### Training Details

As discussed earlier, the fine\-tuning training set is strictly disjoint from the test set: it is generated from different variants under similar themes \(examples in Appendix[E](https://arxiv.org/html/2605.30358#A5)\), which prevents leakage of test content into training\. We generate 4,000 training tasks, totaling approximately 12M tokens\. We run LoRA training on two H100 GPUs \(80GB\), with context length 8192, batch size 8, learning rate1×10−51\\times 10^\{\-5\}, LoRA rank 8, and LoRAα=8\\alpha=8\. We train for 3 epochs\.

### 4\.2Experimental Results

Table[4](https://arxiv.org/html/2605.30358#S4.T4)reports the main pass@1 results\.Base models exhibit limited performance even at the high end: ChatGPT\-5\.2\-Thinking achieves 0\.54 overall, while open\-source baselines are substantially lower\. Performance is comparatively strong on*pulse\-control*tasks \(0\.68–0\.92 across base models\), because many instances resemble structured API\-style waveform construction and function\-like composition\. In contrast, models largely fail on*complex*tasks \(0\.00–0\.08\), where success requires simultaneously satisfying OpenQASM 3 syntax constraints and interpreting long, constraint\-heavy natural\-language specifications\. These results indicate thatthe combination of OpenQASM syntax and dense task descriptions remains the primary bottleneckfor current LLMs\.

Providingfew\-shot exemplars\(3 examples per problem\) substantially improves performance\. The gains are particularly pronounced for larger models on complex tasks: ChatGPT\-5\.2\-Thinking improves from 0\.08 to 0\.64 on the complex category, and its overall pass@1 increases from 0\.54 to 0\.78\. We attribute this to exemplars exposing relevant syntactic patterns and, crucially, illustrating the mapping between the task description and the corresponding OpenQASM implementation\.

Finally,fine\-tuning yields the largest gains\. With LoRA fine\-tuning on the QASM\-Eval training set, a mid\-scale model \(llama70b\) reaches 0\.85 overall and 0\.64 on complex tasks, outperforming even few\-shot augmented ChatGPT\-5\.2\-Thinking\. The small model \(llama8b\) also improves markedly \(from 0\.24 to 0\.52 overall\), approaching the zero\-shot performance of ChatGPT\-5\.2\-Thinking\. In contrast, fine\-tuning on QCircuitBench and Agent\-Q provides only marginal improvements on QASM\-Eval\. Overall, these results suggest that QASM\-Eval fine\-tuning effectively teaches OpenQASM 3\-specific syntax and improves models’ ability to follow QASM\-style, constraint\-driven task specifications\.

Table 4:Main results \(pass@1\) on QASM\-Eval across task categories\. “\-fs” denotes few\-shot prompting with exemplars\. “\-QCB”,“\-AGQ” and “\-ours” denote LoRA fine\-tuning on QCircuitBench\[[47](https://arxiv.org/html/2605.30358#bib.bib6)\], Agent\-Q\[[23](https://arxiv.org/html/2605.30358#bib.bib55)\]and on the QASM\-Eval training set, respectively\.Modelpass@1ClassicalTimingPulseComplexOverallgpt\-5\.20\.480\.680\.920\.080\.54dpsk\-v30\.320\.360\.800\.000\.37llama70b\-base0\.120\.280\.680\.000\.27llama8b\-base0\.080\.200\.680\.000\.24gpt\-5\.2\-fs0\.600\.920\.960\.640\.78dpsk\-v3\-fs0\.480\.800\.880\.440\.65llama70b\-fs0\.600\.720\.840\.240\.60llama8b\-fs0\.080\.320\.880\.000\.32llama70b\-QCB0\.200\.280\.640\.000\.28llama8b\-QCB0\.040\.160\.280\.000\.12llama70b\-AGQ0\.160\.280\.540\.000\.25llama8b\-AGQ0\.040\.160\.240\.000\.11llama70b\-ours0\.800\.961\.000\.640\.85llama8b\-ours0\.520\.640\.840\.080\.52
### 4\.3Performance Analysis

![Refer to caption](https://arxiv.org/html/2605.30358v1/x2.png)Figure 2:Adaptation within the Llama family under few\-shot prompting and fine\-tuning, measured by overall pass@1 on QASM\-Eval\.
![Refer to caption](https://arxiv.org/html/2605.30358v1/x3.png)Figure 3:Error breakdown by type for Llama baselines, few\-shot, and fine\-tuning \(500 samples in total, each may have multiple errors\)

#### Few\-shot and fine\-tuning for the Llama family

Figure[3](https://arxiv.org/html/2605.30358#S4.F3)isolates the impact of prompting and fine\-tuning within the Llama family\. Few\-shot prompting yields a modest gain for Llama\-8B \(0\.24→\\rightarrow0\.32\) but a substantially larger gain for Llama\-70B \(0\.27→\\rightarrow0\.60\), suggesting a strong interaction between model scale and in\-context exemplars\. LoRA fine\-tuning on QASM\-Eval improved Llama\-70B\-ours to 0\.85 overall and Llama\-8B\-ours to 0\.52, Overall, the results follow a consistent trend \(*base*<<*few\-shot*<<*fine\-tuned*\)\. This monotonic progression from base to few\-shot to fine\-tuned motivates a stage\-wise diagnosis of which constraints are learned at each step\. We therefore analyze the failures using our error taxonomy\.

#### Error taxonomy and frequency analysis

Figure[3](https://arxiv.org/html/2605.30358#S4.F3)breaks down errors \(as mentioned in[3\.3](https://arxiv.org/html/2605.30358#S3.SS3)\) by type across the Llama baselines, few\-shot variants, and QASM\-Eval fine\-tuned models\. Across both scales, the overall error count decreases monotonically from*base*to*few\-shot*to*fine\-tuned*\. The most pronounced reduction is in*syntax errors*\(Llama\-70B: 311→\\rightarrow91→\\rightarrow8; Llama\-8B: 244→\\rightarrow179→\\rightarrow75\), suggesting that fine\-tuning primarily improves code validity and formatting compliance—the most directly learnable signal under supervised updates\.

Timing\-related failures exhibit a clear scale\-dependent pattern\. For Llama\-8B,*timeline errors*show little net improvement \(163→\\rightarrow172→\\rightarrow178\), indicating limited gains on timing\-specific semantic constraints\. In contrast, Llama\-70B shows a substantial reduction after fine\-tuning \(97→\\rightarrow121→\\rightarrow65\), consistent with improved handling of timing requirements beyond syntactic correctness\. As the syntax bottleneck is alleviated, the remaining failures increasingly concentrate on semantic constraint violations \(e\.g\., distribution/behavior mismatches\), highlighting semantics as the next limiting factor\.

#### Effect of fine\-tuning data on syntax burden and pass@1

Figure[4](https://arxiv.org/html/2605.30358#S4.F4)further analyzes the mechanism behind fine\-tuning gains by relating, across task categories, changes in*syntax success rate*to changes in pass@1\. We mainly compare to QCircuitBench \(*QCB*\) here because Agent\-Q \(*AGQ*\) is narrower in scope and does not outperform QCB on QASM\-Eval\. Consistent with the error breakdown in Figure[3](https://arxiv.org/html/2605.30358#S4.F3), fine\-tuning on QASM\-Eval training\-set primarily improves syntactic executability, and this improvement strongly correlates with end\-task success\.

Overall, changes in pass@1 closely track changes in syntax success: when the syntax success rate increases, pass@1 tends to rise in tandem; when syntax success drops, pass@1 also deteriorates\. This indicates that on the harder subsets, syntactic validity remains a dominant bottleneck\.

Fine\-tuning on*QCB*does not consistently move models into a higher\-syntax\-success regime: most categories show limited gains or even regressions in syntax success, which is mirrored by weak or negative changes in pass@1\. In contrast, fine\-tuning on QASM\-Eval yields broad and substantial improvements in syntax success across categories, and these improvements translate into higher pass@1\. Notably, the gains are driven primarily by reducing unparsable or non\-executable outputs, rather than by small semantic improvements within already\-executable solutions\.

![Refer to caption](https://arxiv.org/html/2605.30358v1/x4.png)Figure 4:Relationship between changes in syntax success rate and changes in pass@1 across task categories under different fine\-tuning data\.
#### Effect of increasing the number of sampled solutions

Figure[5](https://arxiv.org/html/2605.30358#S4.F5)summarizes pass@kkas a function ofkkfor Llama\-8B and 70B\. For the base models, increasing the sampling budget provides only negligible marginal gains \(Llama\-8B: 0\.24→\\rightarrow0\.26; Llama\-70B: 0\.27→\\rightarrow0\.28\), indicating that the probability of producing a correct solution in the candidates remains extremely low; additional sampling rarely “rescues” an incorrect attempt\. Few\-shot prompting increases the chance that at least one of thekkcandidates is correct, leading to noticeably higher pass@kkcurves\. Fine\-tuning yields the highest pass@1 and continues to benefit from additional samples \(Llama\-8B\-ours: 0\.52→\\rightarrow0\.66; Llama\-70B\-ours: 0\.85→\\rightarrow0\.89\), although the 70B curve saturates quickly\. Overall, these trends suggest that fine\-tuning increases the presence of correct solutions; conversely, relying on heavy sampling is a poor strategy when the model has not yet mastered OpenQASM 3 syntax and features\.

![Refer to caption](https://arxiv.org/html/2605.30358v1/x5.png)

![Refer to caption](https://arxiv.org/html/2605.30358v1/x6.png)

Figure 5:pass@kkas a function of the sampling budgetkkfor llama8b and llama70b under base, few\-shot, and fine\-tuned settings\.

## 5Conclusion

To address the critical gap in evaluating and training LLMs for advanced quantum programming with OpenQASM 3, we introduced QASM\-Eval, the first comprehensive dataset targeting the advanced, hardware\-facing features of OpenQASM 3\. By moving beyond high\-level circuit abstractions to incorporate classical logic, explicit timing, and pulse\-level control, QASM\-Eval provides a rigorous and realistic testbed\. Our automated simulation and scheduling verifier reveals that while current state\-of\-the\-art models struggle with OpenQASM 3’s strict syntactic and semantic constraints, targeted fine\-tuning on our dataset unlocks substantial capabilities\. Notably, fine\-tuning enables mid\-scale models to surpass the performance of few\-shot\-assisted GPT\-5\.2\. By open\-sourcing our dataset, automated generation pipeline, fine\-tuned models and simulation\-verification framework, we aim to accelerate the development of reliable LLM assistants capable of bridging the gap between algorithmic intent and physical quantum execution\.

## References

- \[1\]A\. Abbas, A\. Ambainis, B\. Augustino, A\. Bärtschi, H\. Buhrman, C\. Coffrin, G\. Cortiana, V\. Dunjko, D\. J\. Egger, B\. G\. Elmegreen,et al\.\(2024\)Challenges and opportunities in quantum optimization\.Nature Reviews Physics6\(12\),pp\. 718–735\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1)\.
- \[2\]M\. Ahsan, S\. A\. Z\. Naqvi, and H\. Anwer\(2022\)Quantum circuit engineering for correcting coherent noise\.Physical Review A105\(2\),pp\. 022428\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[3\]A\. Ashikhmin, C\. Lai, and T\. A\. Brun\(2014\)Robust quantum error syndrome extraction by classical coding\.In2014 IEEE International Symposium on Information Theory,pp\. 546–550\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[4\]A\. Basit, N\. Innan, M\. H\. Asif, M\. Shao, M\. Kashif, A\. Marchisio, and M\. Shafique\(2025\)Pennylang: pioneering llm\-based quantum code generation with a novel pennylane\-centric dataset\.arXiv preprint arXiv:2503\.02497\.Cited by:[Appendix C](https://arxiv.org/html/2605.30358#A3.p1.1)\.
- \[5\]V\. Bergholm, J\. Izaac, M\. Schuld, C\. Gogolin, S\. Ahmed, V\. Ajith, M\. S\. Alam, G\. Alonso\-Linaje, B\. AkashNarayanan, A\. Asadi, J\. M\. Arrazola, U\. Azad, S\. Banning, C\. Blank, T\. R\. Bromley, B\. A\. Cordier, J\. Ceroni, A\. Delgado, O\. D\. Matteo, A\. Dusko, T\. Garg, D\. Guala, A\. Hayes, R\. Hill, A\. Ijaz, T\. Isacsson, D\. Ittah, S\. Jahangiri, P\. Jain, E\. Jiang, A\. Khandelwal, K\. Kottmann, R\. A\. Lang, C\. Lee, T\. Loke, A\. Lowe, K\. McKiernan, J\. J\. Meyer, J\. A\. Montañez\-Barrera, R\. Moyard, Z\. Niu, L\. J\. O’Riordan, S\. Oud, A\. Panigrahi, C\. Park, D\. Polatajko, N\. Quesada, C\. Roberts, N\. Sá, I\. Schoch, B\. Shi, S\. Shu, S\. Sim, A\. Singh, I\. Strandberg, J\. Soni, A\. Száva, S\. Thabet, R\. A\. Vargas\-Hernández, T\. Vincent, N\. Vitucci, M\. Weber, D\. Wierichs, R\. Wiersema, M\. Willmann, V\. Wong, S\. Zhang, and N\. Killoran\(2022\)PennyLane: automatic differentiation of hybrid quantum\-classical computations\.External Links:1811\.04968,[Link](https://arxiv.org/abs/1811.04968)Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1)\.
- \[6\]D\. Bhatnagar, M\. Steinberg, D\. Elkouss, C\. G\. Almudever, and S\. Feld\(2023\)Low\-depth flag\-style syndrome extraction for small quantum error\-correction codes\.In2023 IEEE International Conference on Quantum Computing and Engineering \(QCE\),Vol\.1,pp\. 63–69\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]J\. Biamonte, P\. Wittek, N\. Pancotti, P\. Rebentrost, N\. Wiebe, and S\. Lloyd\(2017\)Quantum machine learning\.Nature549\(7671\),pp\. 195–202\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1)\.
- \[8\]T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei\(2020\)Language models are few\-shot learners\.External Links:2005\.14165,[Link](https://arxiv.org/abs/2005.14165)Cited by:[§4\.1](https://arxiv.org/html/2605.30358#S4.SS1.SSS0.Px1.p1.1)\.
- \[9\]J\. J\. Burnett, A\. Bengtsson, M\. Scigliuzzo, D\. Niepce, M\. Kudra, P\. Delsing, and J\. Bylander\(2019\)Decoherence benchmarking of superconducting qubits\.npj Quantum Information5\(1\),pp\. 54\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[10\]J\. Bylander, S\. Gustavsson, F\. Yan, F\. Yoshihara, K\. Harrabi, G\. Fitch, D\. G\. Cory, Y\. Nakamura, J\. Tsai, and W\. D\. Oliver\(2011\)Noise spectroscopy through dynamical decoupling with a superconducting flux qubit\.Nature Physics7\(7\),pp\. 565–570\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]C\. Campbell, H\. M\. Chen, W\. Luk, and H\. Fan\(2025\)Enhancing llm\-based quantum code generation with multi\-agent optimization and quantum error correction\.In2025 62nd ACM/IEEE Design Automation Conference \(DAC\),pp\. 1–7\.Cited by:[Appendix C](https://arxiv.org/html/2605.30358#A3.p1.1)\.
- \[12\]Y\. Cao, J\. Romero, J\. P\. Olson, M\. Degroote, P\. D\. Johnson, M\. Kieferová, I\. D\. Kivlichan, T\. Menke, B\. Peropadre, N\. P\. Sawaya,et al\.\(2019\)Quantum chemistry in the age of quantum computing\.Chemical reviews119\(19\),pp\. 10856–10915\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1)\.
- \[13\]M\. Cerezo, G\. Verdon, H\. Huang, L\. Cincio, and P\. J\. Coles\(2022\)Challenges and opportunities in quantum machine learning\.Nature computational science2\(9\),pp\. 567–576\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1)\.
- \[14\]K\. Chen, W\. Fang, J\. Guan, X\. Hong, M\. Huang, J\. Liu, Q\. Wang, and M\. Ying\(2022\)VeriQBench: a benchmark for multiple types of quantum circuits\.External Links:2206\.10880,[Link](https://arxiv.org/abs/2206.10880)Cited by:[Table 1](https://arxiv.org/html/2605.30358#S1.T1.16.16.5),[§1](https://arxiv.org/html/2605.30358#S1.p3.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px3.p1.1)\.
- \[15\]S\. Chen, J\. Cotler, H\. Huang, and J\. Li\(2023\)The complexity of nisq\.Nature Communications14\(1\),pp\. 6001\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1)\.
- \[16\]A\. Cross, A\. Javadi\-Abhari, T\. Alexander, N\. De Beaudrap, L\. S\. Bishop, S\. Heidel, C\. A\. Ryan, P\. Sivarajah, J\. Smolin, J\. M\. Gambetta,et al\.\(2022\)OpenQASM 3: a broader and deeper quantum assembly language\.ACM Transactions on Quantum Computing3\(3\),pp\. 1–50\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p2.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]C\. Developers\(2025\-08\)Cirq\.Zenodo\.External Links:[Link](https://zenodo.org/doi/10.5281/zenodo.4062499),[Document](https://dx.doi.org/10.5281/ZENODO.4062499)Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1),[§3\.3](https://arxiv.org/html/2605.30358#S3.SS3.SSS0.Px2.p1.1)\.
- \[18\]N\. Dupuis, L\. Buratti, S\. Vishwakarma, A\. V\. Forrat, D\. Kremer, I\. Faro, R\. Puri, and J\. Cruz\-Benito\(2024\)Qiskit code assistant: training llms for generating quantum computing code\.In2024 IEEE LLM Aided Design Workshop \(LAD\),pp\. 1–4\.Cited by:[Appendix C](https://arxiv.org/html/2605.30358#A3.p1.1)\.
- \[19\]D\. Gottesman\(2010\)An introduction to quantum error correction and fault\-tolerant quantum computation\.InQuantum information science and its contributions to mathematics, Proceedings of Symposia in Applied Mathematics,Vol\.68,pp\. 13–58\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]X\. Guo, M\. Wang, and J\. Zhao\(2025\)QuanBench: benchmarking quantum code generation with large language models\.arXiv preprint arXiv:2510\.16779\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px3.p1.1)\.
- \[21\]IBM Quantum\(2026\)OpenQASM 3 feature table\(Website\)IBM\.Note:IBM Quantum DocumentationExternal Links:[Link](https://quantum.cloud.ibm.com/docs/en/guides/qasm-feature-table)Cited by:[§3\.3](https://arxiv.org/html/2605.30358#S3.SS3.p1.1)\.
- \[22\]A\. Javadi\-Abhari, M\. Treinish, K\. Krsulich, C\. J\. Wood, J\. Lishman, J\. Gacon, S\. Martiel, P\. D\. Nation, L\. S\. Bishop, A\. W\. Cross, B\. R\. Johnson, and J\. M\. Gambetta\(2024\)Quantum computing with qiskit\.External Links:2405\.08810,[Link](https://arxiv.org/abs/2405.08810)Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1),[§3\.3](https://arxiv.org/html/2605.30358#S3.SS3.SSS0.Px2.p1.1)\.
- \[23\]L\. Jern, V\. Uotila, C\. Yu, and B\. Zhao\(2025\)Agent\-q: fine\-tuning large language models for quantum circuit generation and optimization\.In2025 IEEE International Conference on Quantum Computing and Engineering \(QCE\),Vol\.1,pp\. 1621–1632\.Cited by:[Table 1](https://arxiv.org/html/2605.30358#S1.T1.12.12.5),[§1](https://arxiv.org/html/2605.30358#S1.p3.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.30358#S4.SS1.SSS0.Px1.p1.1),[Table 4](https://arxiv.org/html/2605.30358#S4.T4)\.
- \[24\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\(2023\)Swe\-bench: can language models resolve real\-world github issues?\.arXiv preprint arXiv:2310\.06770\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p3.1)\.
- \[25\]S\. Joel, J\. Wu, and F\. Fard\(2024\)A survey on llm\-based code generation for low\-resource and domain\-specific programming languages\.ACM Transactions on Software Engineering and Methodology\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p3.1)\.
- \[26\]S\. A\. Joseph, S\. M\. Husain, S\. S\. Offner, S\. Juneau, P\. Torrey, A\. S\. Bolton, J\. P\. Farias, N\. Gaffney, G\. Durrett, and J\. J\. Li\(2025\)Astrovisbench: a code benchmark for scientific computing and visualization in astronomy\.arXiv preprint arXiv:2505\.20538\.Cited by:[§3](https://arxiv.org/html/2605.30358#S3.p1.1)\.
- \[27\]N\. Khaneja, T\. Reiss, C\. Kehlet, T\. Schulte\-Herbrüggen, and S\. J\. Glaser\(2005\)Optimal control of coupled spin dynamics: design of nmr pulse sequences by gradient ascent algorithms\.Journal of magnetic resonance172\(2\),pp\. 296–305\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[28\]K\. Khodjasteh and D\. A\. Lidar\(2005\)Fault\-tolerant quantum dynamical decoupling\.Physical review letters95\(18\),pp\. 180501\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[29\]E\. Knill, D\. Leibfried, R\. Reichle, J\. Britton, R\. B\. Blakestad, J\. D\. Jost, C\. Langer, R\. Ozeri, S\. Seidelin, and D\. J\. Wineland\(2008\)Randomized benchmarking of quantum gates\.Physical Review A—Atomic, Molecular, and Optical Physics77\(1\),pp\. 012307\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[30\]P\. Krantz, M\. Kjaergaard, F\. Yan, T\. P\. Orlando, S\. Gustavsson, and W\. D\. Oliver\(2019\)A quantum engineer’s guide to superconducting qubits\.Applied physics reviews6\(2\)\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]N\. Lambert, E\. Giguère, P\. Menczel, B\. Li, P\. Hopf, G\. Suárez, M\. Gali, J\. Lishman, R\. Gadhvi, R\. Agarwal,et al\.\(2026\)QuTiP 5: the quantum toolbox in python\.Physics Reports1153,pp\. 1–62\.Cited by:[§3\.3](https://arxiv.org/html/2605.30358#S3.SS3.SSS0.Px2.p1.1)\.
- \[32\]S\. Lee, J\. Lee, H\. Zhai, Y\. Tong, A\. M\. Dalzell, A\. Kumar, P\. Helms, J\. Gray, Z\. Cui, W\. Liu,et al\.\(2023\)Evaluating the evidence for exponential quantum advantage in ground\-state quantum chemistry\.Nature communications14\(1\),pp\. 1952\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1)\.
- \[33\]A\. Li, S\. Stein, S\. Krishnamoorthy, and J\. Ang\(2023\-02\)QASMBench: a low\-level quantum benchmark suite for nisq evaluation and simulation\.ACM Transactions on Quantum Computing4\(2\)\.External Links:[Link](https://doi.org/10.1145/3550488),[Document](https://dx.doi.org/10.1145/3550488)Cited by:[Table 1](https://arxiv.org/html/2605.30358#S1.T1.20.20.5),[§1](https://arxiv.org/html/2605.30358#S1.p3.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px3.p1.1)\.
- \[34\]Z\. Lu, Y\. Yang, H\. Ren, H\. Hou, H\. Xiao, K\. Wang, W\. Shi, A\. Zhou, M\. Zhan, and H\. Li\(2025\)Webgen\-bench: evaluating llms on generating interactive and functional websites from scratch\.arXiv preprint arXiv:2505\.03733\.Cited by:[§3](https://arxiv.org/html/2605.30358#S3.p1.1)\.
- \[35\]F\. Motzoi, J\. M\. Gambetta, P\. Rebentrost, and F\. K\. Wilhelm\(2009\)Simple pulses for elimination of leakage in weakly nonlinear qubits\.Physical review letters103\(11\),pp\. 110501\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[36\]D\. Nam, A\. Macvean, V\. Hellendoorn, B\. Vasilescu, and B\. Myers\(2024\)Using an llm to help with code understanding\.InProceedings of the IEEE/ACM 46th International Conference on Software Engineering,pp\. 1–13\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p3.1)\.
- \[37\]E\. Perrier, A\. Youssry, and C\. Ferrie\(2022\)QDataSet, quantum datasets for machine learning\.Scientific data9\(1\),pp\. 582\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px3.p1.1)\.
- \[38\]J\. Preskill\(2018\)Quantum computing in the nisq era and beyond\.Quantum2,pp\. 79\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1)\.
- \[39\]T\. Proctor, M\. Revelle, E\. Nielsen, K\. Rudinger, D\. Lobser, P\. Maunz, R\. Blume\-Kohout, and K\. Young\(2020\)Detecting and tracking drift in quantum information processors\.Nature communications11\(1\),pp\. 5396\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[40\]N\. Quetschlich, L\. Burgholzer, and R\. Wille\(2023\)MQT bench: benchmarking software and design automation tools for quantum computing\.Quantum7,pp\. 1062\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px3.p1.1)\.
- \[41\]K\. Rudinger, T\. Proctor, D\. Langharst, M\. Sarovar, K\. Young, and R\. Blume\-Kohout\(2019\)Probing context\-dependent errors in quantum processors\.Physical Review X9\(2\),pp\. 021045\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[42\]C\. Ryan\-Anderson, J\. G\. Bohnet, K\. Lee, D\. Gresh, A\. Hankin, J\. P\. Gaebler, D\. Francois, A\. Chernoguzov, D\. Lucchetti, N\. C\. Brown,et al\.\(2021\)Realization of real\-time fault\-tolerant quantum error correction\.Physical Review X11\(4\),pp\. 041058\.Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p1.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[43\]C\. Ryan\-Anderson, J\. G\. Bohnet, K\. Lee, D\. Gresh, A\. Hankin, J\. P\. Gaebler, D\. Francois, A\. Chernoguzov, D\. Lucchetti, N\. C\. Brown,et al\.\(2021\)Realization of real\-time fault\-tolerant quantum error correction\.Physical Review X11\(4\),pp\. 041058\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px1.p1.1)\.
- \[44\]E\. Schluntz\(2025\-01\)Raising the bar on SWE\-bench Verified with Claude 3\.5 Sonnet\.Note:[https://www\.anthropic\.com/engineering/swe\-bench\-sonnet](https://www.anthropic.com/engineering/swe-bench-sonnet)Published Jan 06, 2025\. Reports 49% on SWE\-bench Verified with an agent scaffold\. Accessed 2026\-02\-25Cited by:[§1](https://arxiv.org/html/2605.30358#S1.p3.1)\.
- \[45\]S\. Vishwakarma, F\. Harkins, S\. Golecha, V\. S\. Bajpe, N\. Dupuis, L\. Buratti, D\. Kremer, I\. Faro, R\. Puri, and J\. Cruz\-Benito\(2024\)Qiskit humaneval: an evaluation benchmark for quantum code generative models\.In2024 IEEE International Conference on Quantum Computing and Engineering \(QCE\),Vol\.1,pp\. 1169–1176\.Cited by:[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px3.p1.1)\.
- \[46\]T\. Xie, M\. Lin, M\. Liu, Y\. Ye, C\. Chen, and S\. Liu\(2025\)Infochartqa: a benchmark for multimodal question answering on infographic charts\.arXiv preprint arXiv:2505\.19028\.Cited by:[§3](https://arxiv.org/html/2605.30358#S3.p1.1)\.
- \[47\]R\. Yang, Z\. Wang, Y\. Gu, T\. Chen, Y\. Liang, and T\. Li\(2024\)QCircuitBench: a large\-scale dataset for benchmarking quantum algorithm design\.arXiv preprint arXiv:2410\.07961\.Cited by:[Appendix C](https://arxiv.org/html/2605.30358#A3.p2.1),[Table 1](https://arxiv.org/html/2605.30358#S1.T1.8.8.5),[§1](https://arxiv.org/html/2605.30358#S1.p3.1),[§2](https://arxiv.org/html/2605.30358#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.30358#S4.SS1.SSS0.Px1.p1.1),[Table 4](https://arxiv.org/html/2605.30358#S4.T4)\.
- \[48\]C\. Yu, V\. Uotila, S\. Deng, Q\. Wu, T\. Shi, S\. Jiang, L\. You, and B\. Zhao\(2025\)QUASAR: quantum assembly code generation using tool\-augmented llms via agentic rl\.arXiv preprint arXiv:2510\.00967\.Cited by:[Appendix C](https://arxiv.org/html/2605.30358#A3.p2.1)\.

## Appendix AEthics Statement

The QASM\-Eval dataset is composed of synthetically generated code and natural language prompts, meticulously curated through a hybrid pipeline combining human quantum\-programming experts and artificial intelligence\. Designed as a foundational resource to accelerate the development of reliable LLM assistants for hardware\-facing quantum execution, we emphasize that its construction process is strictly isolated from potential intellectual property and legal concerns\. To safeguard against copyright infringement, the dataset does not copy proprietary algorithms or codebases from existing commercial quantum platforms; rather, all tasks originate from background and core\-task templates designed from scratch by domain experts, which are then systematically abstracted, randomized, and diversified to avoid relying on specific commercial implementations\. We intend for researchers to use this publicly available dataset and its accompanying simulation\-verification framework to rigorously evaluate and train LLMs on OpenQASM 3\. We assume full responsibility for the contents of the dataset and release QASM\-Eval, along with our automated generation pipeline and verification code, under an open\-source license to foster transparent, safe, and legally compliant advancements in quantum software engineering\.

## Appendix BLimitations and Future Work

While QASM\-Eval takes a significant step toward enabling LLMs to generate hardware\-facing quantum programs, our work has several limitations that present opportunities for future research\.

#### Syntax Translation vs\. Algorithmic Reasoning

QASM\-Eval effectively aligns models with OpenQASM 3’s strict syntax and novel features, but the current tasks primarily evaluate constraint\-driven code translation\. Future benchmarks should expand beyond natural\-language\-to\-syntax mapping to assess open\-ended algorithmic reasoning, challenging models to design complex protocols from scratch, such as dynamic decoupling sequences or quantum error correction \(QEC\) schemes\.

#### Synthetic Data vs\. Real\-World Complexity

To ensure strict correctness, scalability, and the absence of data leakage, our dataset relies on an expert\-guided synthetic generation pipeline\. However, this approach may not fully capture the unstructured logic, cross\-file dependencies, and extended context windows typical of "in\-the\-wild" quantum codebases\. As the OpenQASM 3 open\-source ecosystem matures, developing repository\-level evaluations will be a necessary progression\.

## Appendix CLLMs for quantum programming

Large language models \(LLMs\) have demonstrated strong capabilities on code\-generation tasks and have been increasingly adopted in quantum\-computing workflows\. Early efforts largely focus on constructing gate\-level circuits using high\-level Python frameworks\. A representative example is IBM’s Qiskit Code Assistant\[[18](https://arxiv.org/html/2605.30358#bib.bib4)\], which trains LLMs to generate Qiskit programs\. Beyond model training, agentic system has also been applied to enhance LLM’s Qiskit\-generation capability\[[11](https://arxiv.org/html/2605.30358#bib.bib1)\]\. In addition to Qiskit\-centric assistants, The agentic framework Pennylang\[[4](https://arxiv.org/html/2605.30358#bib.bib2)\]explored leveraging retrieval\-augmented generation \(RAG\) in writing Pennylane programs\.

More recently, attention has begun to shift toward hardware\-level representations\. For example, emerging work explores reinforcement\-learning method while training LLMs on OpenQASM programs\[[48](https://arxiv.org/html/2605.30358#bib.bib3)\], as well as attempts to fintune LLMs to generate OpenQASM programs for specific set of quantum algorithms\[[47](https://arxiv.org/html/2605.30358#bib.bib6)\]\. Despite this progress, existing LLM\-for\-quantum\-programming research either anchors at high\-level SDK layer where compilation can hide hardware\-facing constraints, or barely cover the advanced features introduced in OpenQASM 3, such as classical logic, timing constraints, and pulse\-level control\.

## Appendix DDataset Details

This section provides additional details of the dataset, including the structure of the background program and the full set of themes used in the core tasks\.

The background program defines the objects and reference settings required by the downstream core tasks\. These include the drive and measurement frequencies of each qubit, the definitions of pulse interfaces \(ports\), waveform specifications such as Gaussian and DRAG pulses, classical function definitions, the initial frames of waveforms, and randomly instantiated pulse operations and gate sequences\. Based on this background, the pipeline randomly generates core tasks according to predefined templates\.

We further enumerate all themes covered by the core tasks from Table[5](https://arxiv.org/html/2605.30358#A4.T5)to[8](https://arxiv.org/html/2605.30358#A4.T8)\. The corresponding syntax categories and language features have already been summarized in Table[2](https://arxiv.org/html/2605.30358#S3.T2)\. Within each theme, different variants may alter the code structure, statement order, operated objects, parameter values, and logical decision criteria\. However, these variations remain confined to the scope of the given theme, ensuring that each task family consistently targets a specific syntax pattern or language feature\.

Table 5:Classical Logic themesIDTheme1measure\_then\_reset\_branch2repeat\_until\_success3conditional\_correction\_multibit4for\_loop\_continue\_pattern5while\_loop\_break\_on\_measure6angle\_arithmetic\_mod\_wrap7angle\_scaled\_by\_uint8float\_expr\_cast\_to\_angle9integer\_arithmetic\_controls\_flow10mixed\_comparison\_with\_cast11bit\_shift\_and\_mask\_test12popcount\_triggered\_gate13rotl\_rotr\_pattern\_match14bitwise\_and\_xor\_pipeline15integer\_switch\_on\_computed\_index16switch\_on\_int\_from\_bit\_literal17set\_iteration\_discrete18array\_iteration\_angles\_or\_ints19single\_extern\_call20nested\_control\_switch\_in\_loop\_with\_if21measurement\_accumulation\_counter22early\_termination\_end23bit\_slice\_alias\_and\_iteration24membership\_test\_in\_set25switch\_const\_expression\_cases\_no\_defaultTable 6:Timing and scheduling themes used in the dataset\.IDTheme1basic\_delay\_units2dt\_delay3duration\_arithmetic4mixed\_units\_duration5left\_alignment\_stretch6right\_alignment\_stretch7center\_alignment\_stretch8proportional\_placement9durationof\_single\_gate10durationof\_subcircuit11dynamic\_padding\_safe12basic\_box\_boundary13timed\_box\_known\_duration14box\_with\_stretch\_fill15barrier\_ordering16hahn\_echo\_midpoint17center\_align\_halfdiff\_known18multi\_qubit\_delay\_semantics19nop\_sync\_in\_box20duration\_update21explicit\_delay\_as\_structure22delay\_zero\_ordering23nested\_box24multi\_stretch\_solve\_system25durationof\_on\_box\_compoundTable 7:Pulse Control themesIDTheme1x\_gaussian\_play2virtual\_z\_shift\_phase3measure\_capture\_bit4sx\_drag\_play5cr\_composite\_multi\_frame6active\_reset\_equal\_time\_branches7sideband\_modulation\_mix8raw\_samples\_waveform\_literal9frame\_sync\_barrier10raw\_capture\_trace11defcal\_q\_vs\_physical\_qubit12multiplexed\_readout\_multi\_frame13simultaneous\_plays\_parallel\_frames14global\_cal\_scope\_reuse15phase\_tracking\_with\_time\_advance16dd\_sequence\_delay\_play17waveform\_dsp\_add\_scale\_phase\_shift18frequency\_control\_set\_shift19frame\_state\_get\_set\_swap20defcal\_matching\_priority21newframe\_time\_origin\_cal\_vs\_defcal22compile\_time\_determinable\_duration23frame\_collision\_avoidance24multi\_frames\_same\_port\_patterns25measurement\_return\_type\_variantsTable 8:Complex Tasks themesIDTheme1active\_reset\_loop2ramsey\_feedback\_phase\_comp3hahn\_echo\_characterization4t1\_relaxation\_gated\_readout5rabi\_amplitude\_scan6qubit\_spectroscopy\_scan7echoed\_cr\_gate\_verification8dynamic\_cnot\_spectator\_comp9multiplexed\_readout10raw\_waveform\_capture\_and\_filter11measurement\_crosstalk\_calibration12realtime\_feedback\_correction13pipeline\_measure\_reset\_prep14randomized\_benchmarking\_controller15repeat\_until\_success\_prep16leakage\_detection\_and\_recovery17virtual\_z\_phase\_tracking\_test18calibration\_hot\_swap\_local\_override19durationof\_alignment\_scheduling20boxed\_dynamic\_decoupling21switch\_routed\_feedforward22in\_shot\_micro\_averaging23late\_as\_possible\_conditional24timeout\_active\_reset\_with\_update25syndrome\_feedforward\_idle\_scheduling
## Appendix EDataset Examples

In this section, we present a concrete dataset instance to illustrate the structure of a full example\. Each instance consists of a background program, a core task, a natural\-language TODO prompt that describes the core task, and several additional variants derived from the same task family\. The background program is shown in two parts: the declarations, constants, calibration resources, and frame definitions appear in Figure[6](https://arxiv.org/html/2605.30358#A5.F6), while the classical initialization and subsequent gate\-level and timing context appear in Figure[7](https://arxiv.org/html/2605.30358#A5.F7)\. A representative core task from the classical\-logic category, specifically thenested\_control\_switch\_in\_loop\_with\_iftheme, is shown in Figure[8](https://arxiv.org/html/2605.30358#A5.F8)\. Its corresponding TODO prompt is shown in Figure[9](https://arxiv.org/html/2605.30358#A5.F9)\. Three additional variants from the same theme are shown in Figure[10](https://arxiv.org/html/2605.30358#A5.F10)\. In the generated dataset, quantities such as variable names, values, collection contents, operand choices, and code lengths are sampled randomly\.

[⬇](data:text/plain;base64,T3BlblFBU01+MzsKaW5jbHVkZSAic3RkZ2F0ZXMuaW5jIjsKLy8gQ29tcGxleCBiYWNrZ3JvdW5kIDAwMDAwMQovLyBDb250YWluczogdGltaW5nICsgY2xhc3NpY2FsICsgb3BlbnB1bHNlIHJlc291cmNlcwpxdWJpdFszXSBxOwovLyAtLS0tIGNsYXNzaWNhbCByZWdpc3RlcnMgLyBjb3VudGVycyAoYXZhaWxhYmxlIHRvIGNvcmUgdGFza3MpIC0tLS0KYml0WzNdIGJnX207CmJpdFszXSBiZ19tMjsKYml0WzNdIHN5bmRyb21lX2I7CmJvb2wgZmxhZzsKaW50WzMyXSBpOwppbnRbMzJdIHN0ZXA7CmludFszMl0gdHJpZXM7CmludFszMl0gcGM7CnVpbnRbMzJdIG1hc2s7CmZsb2F0WzY0XSBnYWluOwovLyBPbmUgZ2VuZXJpYyBleHRlcm4gaG9vayBmb3IgY29udHJvbGxlci1zaWRlIHVwZGF0ZSBsb2dpYwpleHRlcm4gYWRhcHRpdmVfdXBkYXRlKGludFszMl0sIGludFszMl0sIGZsb2F0WzY0XSkgLT4gZmxvYXRbNjRdOwovLyAtLS0tIHRpbWluZy1mcmllbmRseSBjb25zdGFudHMgKGNvcmUgdGFza3MgbWF5IHJldXNlKSAtLS0tCmNvbnN0IGR1cmF0aW9uIEJHX0lETEUgID0gNTJuczsKY29uc3QgZHVyYXRpb24gUk9fUklORyAgPSAzN3VzOwpjb25zdCBkdXJhdGlvbiBCT1hfV0lOICA9IDI1MWR0OwpkZWZjYWxncmFtbWFyICJvcGVucHVsc2UiOwpjb25zdCBmbG9hdCBkcml2ZV9mcmVxXzAgPSA1LjQwNjI4N2U5Owpjb25zdCBmbG9hdCBtZWFzX2ZyZXFfMCAgPSA2LjEwOTEwNWU5Owpjb25zdCBmbG9hdCBkcml2ZV9mcmVxXzEgPSA0Ljg0ODUwOWU5Owpjb25zdCBmbG9hdCBtZWFzX2ZyZXFfMSAgPSA2LjI4MTQ5N2U5Owpjb25zdCBmbG9hdCBkcml2ZV9mcmVxXzIgPSA2Ljc0NDE1NWU5Owpjb25zdCBmbG9hdCBtZWFzX2ZyZXFfMiAgPSA2LjM2ODY2M2U5OwpjYWwgewogIC8vIC0tLSBkZWNsYXJlIHBvcnRzIC0tLQogIGV4dGVybiBwb3J0IGQwOwogIGV4dGVybiBwb3J0IG0wOwogIGV4dGVybiBwb3J0IGEwOwogIGV4dGVybiBwb3J0IGQxOwogIGV4dGVybiBwb3J0IG0xOwogIGV4dGVybiBwb3J0IGExOwogIGV4dGVybiBwb3J0IGQyOwogIGV4dGVybiBwb3J0IG0yOwogIGV4dGVybiBwb3J0IGEyOwogIC8vIC0tLSBleHRlcm4gd2F2ZWZvcm0gdGVtcGxhdGVzIChjb21tb24pIC0tLQogIGV4dGVybiBnYXVzc2lhbihjb21wbGV4W2Zsb2F0WzMyXV0gYW1wLCBkdXJhdGlvbiBkLCBkdXJhdGlvbiBzaWdtYSkgLT4gd2F2ZWZvcm07CiAgZXh0ZXJuIGRyYWcoY29tcGxleFtmbG9hdFszMl1dIGFtcCwgZHVyYXRpb24gZCwgZHVyYXRpb24gc2lnbWEsIGZsb2F0WzMyXSBiZXRhKSAtPiB3YXZlZm9ybTsKICBleHRlcm4gY29uc3RhbnQoY29tcGxleFtmbG9hdFszMl1dIGFtcCwgZHVyYXRpb24gZCkgLT4gd2F2ZWZvcm07CiAgZXh0ZXJuIHNpbmUoY29tcGxleFtmbG9hdFszMl1dIGFtcCwgZHVyYXRpb24gZCwgZmxvYXRbNjRdIGZyZXF1ZW5jeSwgYW5nbGUgcGhhc2UpIC0+IHdhdmVmb3JtOwogIGV4dGVybiBnYXVzc2lhbl9zcXVhcmUoY29tcGxleFtmbG9hdFszMl1dIGFtcCwgZHVyYXRpb24gZCwgZHVyYXRpb24gc3F1YXJlX3dpZHRoLCBkdXJhdGlvbiBzaWdtYSkgLi4uCiAgICAuLi4gLT4gd2F2ZWZvcm07CiAgLy8gLS0tIGV4dGVybiBjYXB0dXJlL2Rpc2NyaW1pbmF0ZSBob29rcyAoZm9yIG1lYXN1cmUtc3R5bGUgdGFza3MpIC0tLQogIGV4dGVybiBjYXB0dXJlKGZyYW1lIGNhcHR1cmVfZnJhbWUsIHdhdmVmb3JtIGZpbHRlcikgLT4gYml0OwogIGV4dGVybiBjYXB0dXJlX3YxKGZyYW1lIGNhcHR1cmVfZnJhbWUsIGR1cmF0aW9uIGQpIC0+IHdhdmVmb3JtOwogIGV4dGVybiBkaXNjcmltaW5hdGUoY29tcGxleFtmbG9hdFs2NF1dIGlxKSAtPiBiaXQ7CiAgLy8gLS0tIGNyZWF0ZSBmcmFtZXMgLS0tCiAgZnJhbWUgcTBfZHJpdmUgPSBuZXdmcmFtZShkMCwgZHJpdmVfZnJlcV8wLCAwLjApOwogIGZyYW1lIHEwX21lYXMgID0gbmV3ZnJhbWUobTAsICBtZWFzX2ZyZXFfMCwgIDAuMCk7CiAgZnJhbWUgcTBfYWNxICAgPSBuZXdmcmFtZShhMCwgIG1lYXNfZnJlcV8wLCAgMC4wKTsKICBmcmFtZSBxMV9kcml2ZSA9IG5ld2ZyYW1lKGQxLCBkcml2ZV9mcmVxXzEsIDAuMCk7CiAgZnJhbWUgcTFfbWVhcyAgPSBuZXdmcmFtZShtMSwgIG1lYXNfZnJlcV8xLCAgMC4wKTsKICBmcmFtZSBxMV9hY3EgICA9IG5ld2ZyYW1lKGExLCAgbWVhc19mcmVxXzEsICAwLjApOwogIGZyYW1lIHEyX2RyaXZlID0gbmV3ZnJhbWUoZDIsIGRyaXZlX2ZyZXFfMiwgMC4wKTsKICBmcmFtZSBxMl9tZWFzICA9IG5ld2ZyYW1lKG0yLCAgbWVhc19mcmVxXzIsICAwLjApOwogIGZyYW1lIHEyX2FjcSAgID0gbmV3ZnJhbWUoYTIsICBtZWFzX2ZyZXFfMiwgIDAuMCk7CiAgLy8gLS0tIGJhY2tncm91bmQgcHVsc2UgYWN0aW9ucyAodmVyeSBsaWdodHdlaWdodDsgbm8gcGxheSkgLS0tCiAgc2hpZnRfcGhhc2UocTBfZHJpdmUsIDAuMjY2Mzk4KTsKICBzZXRfZnJlcXVlbmN5KHEwX2RyaXZlLCAoZ2V0X2ZyZXF1ZW5jeShxMF9kcml2ZSkgKyAtMjYxMTE0My4wKSk7CiAgZGVsYXlbMThkdF0gcTBfZHJpdmU7CiAgYmFycmllciBxMF9kcml2ZSwgcTJfZHJpdmU7Cn0=)OpenQASM~3;include"stdgates\.inc";//Complexbackground000001//Contains:timing\+classical\+openpulseresourcesqubit\[3\]q;//\-\-\-\-classicalregisters/counters\(availabletocoretasks\)\-\-\-\-bit\[3\]bg\_m;bit\[3\]bg\_m2;bit\[3\]syndrome\_b;boolflag;int\[32\]i;int\[32\]step;int\[32\]tries;int\[32\]pc;uint\[32\]mask;float\[64\]gain;//Onegenericexternhookforcontroller\-sideupdatelogicexternadaptive\_update\(int\[32\],int\[32\],float\[64\]\)\-\>float\[64\];//\-\-\-\-timing\-friendlyconstants\(coretasksmayreuse\)\-\-\-\-constdurationBG\_IDLE=52ns;constdurationRO\_RING=37us;constdurationBOX\_WIN=251dt;defcalgrammar"openpulse";constfloatdrive\_freq\_0=5\.406287e9;constfloatmeas\_freq\_0=6\.109105e9;constfloatdrive\_freq\_1=4\.848509e9;constfloatmeas\_freq\_1=6\.281497e9;constfloatdrive\_freq\_2=6\.744155e9;constfloatmeas\_freq\_2=6\.368663e9;cal\{//\-\-\-declareports\-\-\-externportd0;externportm0;externporta0;externportd1;externportm1;externporta1;externportd2;externportm2;externporta2;//\-\-\-externwaveformtemplates\(common\)\-\-\-externgaussian\(complex\[float\[32\]\]amp,durationd,durationsigma\)\-\>waveform;externdrag\(complex\[float\[32\]\]amp,durationd,durationsigma,float\[32\]beta\)\-\>waveform;externconstant\(complex\[float\[32\]\]amp,durationd\)\-\>waveform;externsine\(complex\[float\[32\]\]amp,durationd,float\[64\]frequency,anglephase\)\-\>waveform;externgaussian\_square\(complex\[float\[32\]\]amp,durationd,durationsquare\_width,durationsigma\)\.\.\.\.\.\.\-\>waveform;//\-\-\-externcapture/discriminatehooks\(formeasure\-styletasks\)\-\-\-externcapture\(framecapture\_frame,waveformfilter\)\-\>bit;externcapture\_v1\(framecapture\_frame,durationd\)\-\>waveform;externdiscriminate\(complex\[float\[64\]\]iq\)\-\>bit;//\-\-\-createframes\-\-\-frameq0\_drive=newframe\(d0,drive\_freq\_0,0\.0\);frameq0\_meas=newframe\(m0,meas\_freq\_0,0\.0\);frameq0\_acq=newframe\(a0,meas\_freq\_0,0\.0\);frameq1\_drive=newframe\(d1,drive\_freq\_1,0\.0\);frameq1\_meas=newframe\(m1,meas\_freq\_1,0\.0\);frameq1\_acq=newframe\(a1,meas\_freq\_1,0\.0\);frameq2\_drive=newframe\(d2,drive\_freq\_2,0\.0\);frameq2\_meas=newframe\(m2,meas\_freq\_2,0\.0\);frameq2\_acq=newframe\(a2,meas\_freq\_2,0\.0\);//\-\-\-backgroundpulseactions\(verylightweight;noplay\)\-\-\-shift\_phase\(q0\_drive,0\.266398\);set\_frequency\(q0\_drive,\(get\_frequency\(q0\_drive\)\+\-2611143\.0\)\);delay\[18dt\]q0\_drive;barrierq0\_drive,q2\_drive;\}Figure 6:Example background program, Part I\. This part defines the qubits, classical variables, timing constants, calibration grammar, external pulse\-level resources, and frame objects used by downstream core tasks\.[⬇](data:text/plain;base64,Ly8gLS0tIGNsYXNzaWNhbCBpbml0IChsaWdodHdlaWdodCkgLS0tCmZsYWcgPSBmYWxzZTsKc3RlcCA9IDA7CnRyaWVzID0gMDsKcGMgPSAwOwptYXNrID0gMDsKZ2FpbiA9IDEuMDsKYmdfbVswXSA9IDA7CmJnX20yWzBdID0gMDsKc3luZHJvbWVfYlswXSA9IDA7CmJnX21bMV0gPSAwOwpiZ19tMlsxXSA9IDA7CnN5bmRyb21lX2JbMV0gPSAwOwpiZ19tWzJdID0gMDsKYmdfbTJbMl0gPSAwOwpzeW5kcm9tZV9iWzJdID0gMDsKCi8vIC0tLSBnYXRlLWxldmVsICsgdGltaW5nIGJhY2tncm91bmQgY2lyY3VpdCAtLS0KcnkoLTAuNzYwMzAyKSBxWzBdOwpyeigxLjE1OTk0OCkgcVsyXTsKeiBxWzBdOwpyeSgtMC45OTkyMzEpIHFbMV07CnJ4KC0wLjIzNzE5NykgcVsyXTsKY3ogcVsxXSwgcVswXTsKdGRnIHFbMF07CgovLyAtLS0gdGltaW5nIGNvbnRleHQ6IGEgZml4ZWQgY29udHJvbCB3aW5kb3cgKGZvciBERCAvIHJlYWRvdXQgcmluZykgLS0tCmJveFtCT1hfV0lOXSB7CiAgZGVsYXlbQkdfSURMRV0gcVswXTsKICAvLyBpbmRlcGVuZGVudCBpZGxlIG9uIGFub3RoZXIgcXViaXQgdG8gZW5jb3VyYWdlIHBhcmFsbGVsIHNjaGVkdWxpbmcKICBkZWxheVtCR19JRExFXSBxWzFdOwp9CgovLyA9PT0gQ09SRV9UQVNLX1NUQVJUID09PQovLyAoY29yZSBjb21wbGV4IHRhc2sgd2lsbCBiZSBpbnNlcnRlZCBoZXJlKQovLyA9PT0gQ09SRV9UQVNLX0VORCA9PT0KCi8vID09PSBNRUFTVVJFTUVOVF9TVEFSVCA9PT0KLy8gKG1lYXN1cmVtZW50IGJsb2NrIHdpbGwgYmUgaW5zZXJ0ZWQgaGVyZSkKLy8gPT09IE1FQVNVUkVNRU5UX0VORCA9PT0=)//\-\-\-classicalinit\(lightweight\)\-\-\-flag=false;step=0;tries=0;pc=0;mask=0;gain=1\.0;bg\_m\[0\]=0;bg\_m2\[0\]=0;syndrome\_b\[0\]=0;bg\_m\[1\]=0;bg\_m2\[1\]=0;syndrome\_b\[1\]=0;bg\_m\[2\]=0;bg\_m2\[2\]=0;syndrome\_b\[2\]=0;//\-\-\-gate\-level\+timingbackgroundcircuit\-\-\-ry\(\-0\.760302\)q\[0\];rz\(1\.159948\)q\[2\];zq\[0\];ry\(\-0\.999231\)q\[1\];rx\(\-0\.237197\)q\[2\];czq\[1\],q\[0\];tdgq\[0\];//\-\-\-timingcontext:afixedcontrolwindow\(forDD/readoutring\)\-\-\-box\[BOX\_WIN\]\{delay\[BG\_IDLE\]q\[0\];//independentidleonanotherqubittoencourageparallelschedulingdelay\[BG\_IDLE\]q\[1\];\}//===CORE\_TASK\_START===//\(corecomplextaskwillbeinsertedhere\)//===CORE\_TASK\_END===//===MEASUREMENT\_START===//\(measurementblockwillbeinsertedhere\)//===MEASUREMENT\_END===Figure 7:Example background program, Part II\. This part initializes the classical state, provides lightweight gate\-level and timing context, and reserves the insertion locations for the core task and the measurement block\. In the full dataset, names, values, counts, and related attributes are generated randomly\.[⬇](data:text/plain;base64,Zm9yIGludCBfX2NjX2kgaW4gezAsMSwyLDMsNCw1fSB7CiAgc3dpdGNoIChfX2NjX2kgJSAzKSB7CiAgICBjYXNlIDAgeyB4IHFbM107IH0KICAgIGNhc2UgMSB7IGlmIChfX2NjX2kgPiAxKSB7IHogcVszXTsgfSBlbHNlIHsgaCBxWzNdOyB9IH0KICAgIGRlZmF1bHQgeyB5IHFbM107IH0KICB9Cn0=)forint\_\_cc\_iin\{0,1,2,3,4,5\}\{switch\(\_\_cc\_i%3\)\{case0\{xq\[3\];\}case1\{if\(\_\_cc\_i\>1\)\{zq\[3\];\}else\{hq\[3\];\}\}default\{yq\[3\];\}\}\}Figure 8:An example core task from the classical\-logic category, instantiated from thenested\_control\_switch\_in\_loop\_with\_iftheme\. As in other dataset instances, the iteration set, operands, and target objects are sampled randomly\.[⬇](data:text/plain;base64,Ly8gVE9ETyhjb3JlIHRhc2spOiBBIGxvb3AgcnVucyBzaXggaXRlcmF0aW9ucyB3aXRoIGFuIGludGVnZXIgaW5kZXggdGFraW5nCi8vIHZhbHVlcyAwIHRocm91Z2ggNSwgYW5kIGluIGVhY2ggaXRlcmF0aW9uIGl0IGFwcGxpZXMgZXhhY3RseSBvbmUKLy8gc2luZ2xlLXF1Yml0IGdhdGUgdG8gcXViaXQgcTMgYmFzZWQgb24gdGhlIHZhbHVlIG9mIHRoZSBpbmRleCBtb2R1bG8gMy4KLy8gV2hlbiB0aGUgaW5kZXggbW9kdWxvIDMgZXF1YWxzIDAsIGl0IGFwcGxpZXMgYW4gWCB0byBxMzsKLy8gV2hlbiB0aGUgaW5kZXggbW9kdWxvIDMgZXF1YWxzIDEsIGl0IGFwcGxpZXMgSCB0byBxMyBmb3IgaW5kaWNlcyAxIG9yIGxlc3MsCi8vIGJ1dCBhcHBsaWVzIFogdG8gcTMgZm9yIGluZGljZXMgZ3JlYXRlciB0aGFuIDEuCi8vIEluIG90aGVyIGNvbmRpdGlvbnMsIGFwcGxpZXMgWSB0byBxMw==)//TODO\(coretask\):Alooprunssixiterationswithanintegerindextaking//values0through5,andineachiterationitappliesexactlyone//single\-qubitgatetoqubitq3basedonthevalueoftheindexmodulo3\.//Whentheindexmodulo3equals0,itappliesanXtoq3;//Whentheindexmodulo3equals1,itappliesHtoq3forindices1orless,//butappliesZtoq3forindicesgreaterthan1\.//Inotherconditions,appliesYtoq3Figure 9:The TODO prompt corresponding to the core task in Figure[8](https://arxiv.org/html/2605.30358#A5.F8)\. This prompt provides a semantic natural\-language description of the missing code block\.[⬇](data:text/plain;base64,Ly8gVmFyaWFudCAxCmZvciBpbnQgX19jY19pIGluIHswLDEsMiwzLDQsNX0gewogIHN3aXRjaCAoX19jY19pKSB7CiAgICBjYXNlIDUgeyBicmVhazsgfQogICAgZGVmYXVsdCB7IHggcVszXTsgfQogIH0KfQoKLy8gVmFyaWFudCAyCmludCBfX2NjX3MgPSAxOwpzd2l0Y2ggKF9fY2NfcykgewogIGNhc2UgMCB7IHggcVswXTsgfQogIGRlZmF1bHQgewogICAgYml0IF9fY2NfbSA9IG1lYXN1cmUgcVswXTsKICAgIGlmIChfX2NjX20pIHsgc3dpdGNoICgxKSB7IGNhc2UgMSB7IHogcVswXTsgfSB9IH0gZWxzZSB7IGggcVswXTsgfQogIH0KfQoKLy8gVmFyaWFudCAzCmZvciBpbnQgX19jY19pIGluIHswLDEsMn0gewogIGJpdCBfX2NjX20gPSBtZWFzdXJlIHFbMF07CiAgaW50IF9fY2NfayA9IGludChfX2NjX20pICsgKF9fY2NfaSAlIDIpOwogIGggcVswXTsKICBzd2l0Y2ggKF9fY2NfaykgeyBjYXNlIDAge3ggcVswXTt9IGNhc2UgMSB7eiBxWzBdO30gZGVmYXVsdCB7aCBxWzBdO30gfQp9)//Variant1forint\_\_cc\_iin\{0,1,2,3,4,5\}\{switch\(\_\_cc\_i\)\{case5\{break;\}default\{xq\[3\];\}\}\}//Variant2int\_\_cc\_s=1;switch\(\_\_cc\_s\)\{case0\{xq\[0\];\}default\{bit\_\_cc\_m=measureq\[0\];if\(\_\_cc\_m\)\{switch\(1\)\{case1\{zq\[0\];\}\}\}else\{hq\[0\];\}\}\}//Variant3forint\_\_cc\_iin\{0,1,2\}\{bit\_\_cc\_m=measureq\[0\];int\_\_cc\_k=int\(\_\_cc\_m\)\+\(\_\_cc\_i%2\);hq\[0\];switch\(\_\_cc\_k\)\{case0\{xq\[0\];\}case1\{zq\[0\];\}default\{hq\[0\];\}\}\}Figure 10:Three additional variants from the same task theme as Figure[8](https://arxiv.org/html/2605.30358#A5.F8)\. These variants preserve the syntactic and semantic scope of the theme while changing the concrete control\-flow structure, operation order, operands, target objects, and parameter choices\. As in the full dataset, these elements are sampled randomly\.
## Appendix FError Examples

In this section, we first present examples in which human experts and the verifier disagree\. We then show several representative failure cases in LLM\-generated code\. Human expert judgments are highly aligned with the verifier overall, but occasional disagreements still arise when a TODO prompt admits more than one plausible interpretation\.

One such example appears in a classical\-logic task from Theme 11\. The TODO prompt is as follows:

//TODO\(coretask\):Thesnippetdefinesabaserotationangleof0\.158116and

//theniteratesoverthediscretesetofintegervalues0,2,3,5,and6\.

//Foreachvalue,itcomputesaZ\-rotationangleequaltothatinteger

//multipliedbythebaseangle,andappliesanRZrotationbythatcomputed

//angletoqubitq\[3\]\.ThisresultsinfivesequentialZ\-axisphaserotations

//onq\[3\],includingazero\-anglerotationwhentheintegeris0\.

The response generated bygpt\-5\.2\-thinkingwas:

rz\(0\.0\)q\[3\];

rz\(0\.316232\)q\[3\];

rz\(0\.474348\)q\[3\];

rz\(0\.79058\)q\[3\];

rz\(0\.948696\)q\[3\];

This answer was motivated by the requirement that the rotations be applied sequentially\. However, the human expert considered the intended solution to require explicit iteration over the given set, and therefore preferred the following form:

angle\_\_cc\_base=0\.158116;

forint\_\_cc\_iin\{0,2,3,5,6\}\{

angle\_\_cc\_th=angle\(float\(\_\_cc\_i\)\*float\(\_\_cc\_base\)\);

rz\(\_\_cc\_th\)q\[3\];

\}

The verifier judged the model output to be correct, since the prompt does not explicitly require a loop\-based implementation\. The human expert, by contrast, applied a stricter interpretation and marked the answer as incorrect\. This example shows that disagreement can arise even when the generated code is semantically consistent with the textual description\.

As for the code given by LLMs, their outputs in QASM\-Eval frequently exhibit syntax errors\. Semantic failures caused by incorrect understanding of task intent may manifest as element, distribution, or timeline errors, but these cases are highly diverse and do not admit representative patterns\. Syntax errors, by contrast, recur in more stable forms\. We list several typical examples below\.

A frequent mistake concerns the syntax of measurement assignment\. The correct form is:

However, LLMs often generate invalid alternatives such as:

measureq\[2\]to\_\_cc\_m;

measureq\[2\]\-\>\_\_cc\_m;

Here,toand\-\>are invalid in OpenQASM 3 measurement assignment syntax\.

Another common error appears inswitch\-casestatements\. The correct form is:

switch\(\_\_cc\_x\)\{

case\_\_cc\_A\+1\{xq\[1\];\}

case\_\_cc\_B\+2\{zq\[1\];\}

\}

A typical incorrect output is:

switch\(\_\_cc\_x\)\{

case\_\_cc\_A\+1:\{xq\[1\];\}

case\_\_cc\_B\+2:\{zq\[1\];\}

\}

In this case, the colon after eachcaselabel is invalid\.

LLMs also sometimes fail to recognize OpenQASM 3 timing and boxed scheduling constructs, and instead produce syntax borrowed from unrelated languages or imagined abstractions\. The correct code is:

box\[229ns\]\{delay\[a\]q\[4\];delay\[b\]q\[4\];\}

box\[155ns\]\{delay\[2\*a\]q\[4\];delay\[b\]q\[4\];\}

A representative incorrect output is:

timerange\[229\]\{delay\(a,q\[4\]\);delay\(b,q\[4\]\);\}

timerange\[155\]\{delay\(2\*a,q\[4\]\);delay\(b,q\[4\]\);\}

Here,timerangeis not a valid OpenQASM 3 construct, and the correspondingdelaysyntax is also incorrect\.

Taken together, these examples indicate that current LLMs still lack sufficiently robust knowledge of OpenQASM 3 syntax and language\-specific features\. Even when the high\-level task intent is understood correctly, the generated output often fails at the level of exact grammar and construct usage\.

## Appendix GPrompts Used in Experiments

This section presents all prompts involved in the LLM\-based pipeline of this work\. They serve three distinct purposes\. The first prompt is used to generate variants of core\-task generators\. The second prompt is used to convert a core task into a natural\-language task description\. The third prompt is used during evaluation, where the model is asked to generate the missing answer block from the task description and the surrounding program context\. For reproducibility, the full prompts are shown in Figures[11](https://arxiv.org/html/2605.30358#A7.F11),[12](https://arxiv.org/html/2605.30358#A7.F12), and[13](https://arxiv.org/html/2605.30358#A7.F13)\.

Prompt for core\-task variant generationInstruction:You are an expert in quantum programming\.Below is reference documentation for OpenQASM 3:\{qasm\_documents\}Given the following core\-task theme\{theme\}and the provided core\-task template\{template\},\\texttt\{\\\{template\\\}\},generate three new core\-task generator variants corresponding to versionsv=2,3,4v=2,3,4\.Each variant must remain within the scope of the same theme and use the same syntax and language features as the template, while varying the code structure, statement ordering, and control\-flow logic\. As in the original template, placeholders should be preserved so that later stages can instantiate them randomly\.For each branch, also provide brief comments and metadata\. The comments should indicate which OpenPulse or OpenQASM features are exercised\. The metadata should specify any required contextual assumptions, such as the availability of ports or frames, timing granularity indt, or other hardware\-related constraints\.Figure 11:Prompt used to generate new variants of core\-task generators from an existing theme and template\.Prompt for task\-description generationSystem prompt:You are given a quantum program snippet\. Write a natural\-language description of what the snippet does\.Requirements:•Mention the involved objects, such as qubits, classical bits, registers, frames, and durations, together with the key parameters, including angles, indices, durations, and constants\.•Do not include code, do not quote any lines, and do not use backticks\.•Do not explain how to write the snippet in any programming language, and do not mention punctuation such as brackets, parentheses, or semicolons\.•Operations may be described at the conceptual level, for example as rotations, entangling operations, delays, or measurements, but the description should remain semantic and high\-level\.•Keep the description concise, within 2–6 sentences\.User prompt:Context \(declarations and earlier operations, for naming only\):\{background\}Core snippet to describe \(this is the portion to be replaced\):\{core\_task\}Following context \(may help disambiguate intent\):\{measurement\_step\}Now write the description\.Figure 12:Prompt used to convert a core task into a natural\-language task description\.Prompt for evaluation\-time completionSystem prompt:You complete missing QASM core blocks\.Output only the QASM statements that belong betweenCORE\_TASK\_STARTandCORE\_TASK\_END\. Do not output the markers themselves\. Do not output explanations\. Do not use backticks\.User prompt:Here is a QASM program with a missing core block\. Fill in the missing core block\.Return only the missing core QASM statements\.\-\-\-\-\- BEGIN PROGRAM \-\-\-\-\-\{prompt\_qasm\}\-\-\-\-\- END PROGRAM \-\-\-\-\-Figure 13:Prompt used during evaluation to complete the missing core QASM block\.
## Appendix HExtra Evaluation Results

We also evaluated several Qwen\-family models on QASM\-Eval\. However, becauseQwen3\-Coder\-480Bwas used during dataset construction to generate task variants and TODO prompts, we do not report Qwen\-related results in the main text in order to avoid potential bias\. For completeness, we provide these results here\. Table[9](https://arxiv.org/html/2605.30358#A8.T9)reports the pass@1 scores ofQwen3\-235B\-A22B\-Instruct\-2507\(qwen235b\),Qwen3\-30B\-A3B\-Instruct\-2507\(qwen30b\), andQwen2\.5\-Coder\-7B\(qwen7b\)\.

Table 9:Evaluation results of Qwen\-family models on QASM\-Eval, measured by pass@1 across task categories\.Modelpass@1ClassicalTimingPulseComplexOverallqwen235b0\.360\.480\.760\.000\.40qwen30b0\.200\.280\.320\.000\.20qwen7b0\.000\.000\.000\.000\.00In addition, the error counts for the Qwen family are shown in Figure[14](https://arxiv.org/html/2605.30358#A8.F14)\.

![Refer to caption](https://arxiv.org/html/2605.30358v1/x7.png)Figure 14:Breakdown of error counts by type for the Qwen family\. The evaluation includes 500 samples in total, and each sample may contain multiple errors\.These results suggest that larger models achieve better overall performance, but their gains remain limited by insufficient mastery of OpenQASM 3 syntax and language\-specific constructs\.

Similar Articles

Evaluating open source LLMs on Autonomous Codenames Simulations

Reddit r/AI_Agents

A developer built a Codenames simulation arena to evaluate open-source LLMs on long-range collaboration, finding DeepSeek v4 Flash outperformed others with high game logic alignment, while Qwen 3 Next and GPT 5.4 Nano struggled with rule constraints and perspective-taking.