Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation
Summary
This paper presents STG, a structured testbench generation framework for LLM-driven hardware design workflows that reduces token cost and improves verification reliability compared to existing prompt-based approaches.
View Cached Full Text
Cached at: 06/12/26, 08:54 AM
# Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation Source: [https://arxiv.org/html/2606.12983](https://arxiv.org/html/2606.12983) En\-Ming Huang1, Yu\-Hung Kao1, Ren\-Hao Deng1, Wei\-Po Hsin1, Yao\-Ting Hsieh2, Cheng Liang1, Hsiang\-Yu Tsou1, Mu\-Chi Chen1, Yu\-Kai Hung1, Shao\-Chun Ho1, Po\-Hsuang Huang1, Shih\-Hao Hung1, H\.T\. Kung31National Taiwan University,2Academia Sinica,3Harvard University[r13922078@csie\.ntu\.edu\.tw, hungsh@csie\.ntu\.edu\.tw, kung@harvard\.edu](https://arxiv.org/html/2606.12983v1/mailto:[email protected],%[email protected],%[email protected]) \(2025\) ###### Abstract\. Automated testbench generation has become a critical bottleneck in large language model \(LLM\)\-driven Register Transfer Level \(RTL\) workflows, where large numbers of candidate designs must be verified rapidly and reliably\. Existing prompt\-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage\. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches\. As a direct verification tool, STG runs720×720\\timesfaster than an iterative LLM\-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false\-pass verdicts on incorrect DUTs\. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches\. As a data curation engine, it is11×11\\timesfaster than LLM\-based filtering on a single CPU core with127×127\\timesless energy, and the resulting distilled models provide state\-of\-the\-art performance in our multi\-benchmark evaluation\. As a test\-time scaling oracle, it reduces node count by 14\-47%\. Our models are available at[https://huggingface\.co/collections/AS\-SiliconMind/siliconmind\-v12](https://huggingface.co/collections/AS-SiliconMind/siliconmind-v12)\. ††copyright:acmlicensed††journalyear:2025††doi:XXXXXXX\.XXXXXXX††isbn:XXX\-X\-XXXX\-XXXX\-X/2026/07## 1\.Introduction Functional verification remains one of the most labor\-intensive stages of hardware design\. As Register Transfer Level \(RTL\) designs grow in complexity, constructing testbenches that expose corner cases requires substantial manual effort\. Prior work has explored automated stimulus generation through finite\-state machine \(FSM\) modeling\(Chow1978FSM\), coverage\-guided simulation\(Amla2001BiasedRandom\), and probabilistic methods\(Ferens2003Bayesian\), yet practical testbench development remains a major bottleneck\. This challenge becomes more acute in the era of large language models \(LLMs\), where hardware description language \(HDL\) code can now be generated at scale from natural languages\. Recent LLM\-driven hardware design systems use generated verification artifacts for spec\-to\-RTL evaluation, dataset construction, and test\-time feedback\(Liu2024AutoBench;Liu2025CorrectBench;Liu2025ConfiBench;Yao2025CodeV;Chen2026SiliconMindV1\)\. In such settings, verification is no longer only a downstream design step; it becomes a core mechanism for validating generated HDL artifacts, filtering low\-quality outputs, and organizing data for subsequent model improvement\. Nevertheless, we observe that existing LLM\-based testbench generation methods\(Liu2024AutoBench;Liu2025CorrectBench;Liu2025ConfiBench;teng2025verirl\)are framed as unconstrained code generation, which leads to two limitations\. First, as testbenches are generated through a stochastic process by LLMs, improving reliability requires iterative prompting or ensemble generation, thereby increasing token cost\. Second, this formulation overlooks the structured nature of simulation\-based verification: module instantiation, output checking, and reporting can be generated directly, while the core challenge reduces to producing high\-coverage stimuli\. These limitations are amplified by several emerging demands in LLM\-driven HDL workflows\. Test\-time scaling techniques—such as Monte Carlo Tree Search \(MCTS\)\-based workflow search\(wei2026vflow\)and evolutionary refinement\(novikov2025alphaevolve;min2026revolution\)—now place verification inside an iterative optimization loop, where each candidate revision must be evaluated quickly and reliably before the search can proceed; a noisy verification signal directly degrades search efficiency and quality\. Model\-distillation pipelines generate large numbers of candidate DUTs that must be validated before they can serve as training data\(QiMeng2025CodeVR1;teng2025verirl;Chen2026SiliconMindV1\); weak or unstable testbenches misclassify candidate designs, introduce noisy labels, and reduce the value of the curated dataset\. This cost pressure intensifies further as LLM training moves toward continuous learning, where models are iteratively retrained on freshly data\(2025continuelearningsurvey\), and as distilled smaller models find new roles such as speculative\-decoding draft models\(2023specdec\)that accelerate large\-model inference\. All settings demand a low\-cost verification mechanism that scales to large numbers of candidates\. In this work, we present STG, a Structured Testbench Generation framework that combines lightweight HDL analysis with template\-based rendering to produce testbenches deterministically for both combinational and general sequential designs\. STG is designed to serve as a general\-purpose verification backend for LLM\-driven HDL workflows, supporting three closely related scenarios: \(i\) direct RTL verification, in which candidates are verified against a golden reference; \(ii\) verification\-oriented dataset curation, as large batches of generated artifacts must be filtered before use as training data; and \(iii\) test\-time scaling, where reliable verification feedback must be provided at every iteration of an LLM\-guided refinement loop\. We evaluate STG and its applications on Verilog generation benchmarks\(Liu2023VerilogEval;Thakur2024RevisitingVerilogEval\)\. STG generates testbenches720×720\\timesfaster than an iterative LLM\-based testbench generation pipeline\(Liu2025ConfiBench\), with higher line and toggle coverage and fewer false\-pass verdicts on incorrect DUTs\. For data curation, STG is10\.6×10\.6\\timesfaster on a CPU core with127×127\\timesless energy than LLM\-based filtering, and our simple supervised fine\-tuning \(SFT\) pipeline yields competitive or superior results in our multi\-benchmark evaluation\(Liu2023VerilogEval;Thakur2024RevisitingVerilogEval;pinckney2025cvdp;lu2024rtllm\)while using less training data than recent specialized baselines\(teng2025verirl;QiMeng2025CodeVR1;Chen2026SiliconMindV1\)\. In test\-time scaling, STG reduces solved\-problem node count by 14–47% on existing LLMs\(Chen2026SiliconMindV1;openai2025gptoss;Guo2025deepseek\)\. We also identify and correct a systematic race condition in VerilogEval’s testbenches\(Liu2023VerilogEval;Thakur2024RevisitingVerilogEval\)through STG’s deterministic generation and human inspection\. Our results further indicate that the effectiveness of recent complex training and reinforcement learning workflows\(teng2025verirl;QiMeng2025CodeVR1\)remains questionable\. Our main contributions are threefold:\(1\)We present STG, a deterministic and structure\-aware testbench generation framework for RTL verification that improves over prompt\-based LLM testbench generation in efficiency, coverage, and reliability\.\(2\)We show that STG enables efficient verification\-oriented data curation and supports strong distilled RTL generation models using a simple pipeline\.\(3\)We demonstrate that STG serves as an effective verification backend for LLM\-driven RTL refinement, improving search quality and efficiency across multiple backbone models\. Out models are available at[https://huggingface\.co/collections/AS\-SiliconMind/siliconmind\-v12](https://huggingface.co/collections/AS-SiliconMind/siliconmind-v12)\. The remainder of this paper is organized as follows\. Section[2](https://arxiv.org/html/2606.12983#S2)reviews related work on LLM\-based testbench generation and verification\-oriented RTL workflows\. Section[3](https://arxiv.org/html/2606.12983#S3)introduces the STG framework\. Section[4](https://arxiv.org/html/2606.12983#S4)describes the applications of STG\. Section[5](https://arxiv.org/html/2606.12983#S5)presents experimental results, and Section[6](https://arxiv.org/html/2606.12983#S6)concludes\. ## 2\.Problem Definition and Background This section formalizes the verification problem addressed by STG\. We define the known\-reference RTL verification setting and its requirements, review existing LLM\-based testbench\-generation workflows, and discuss how test\-time scaling and data curation create additional demands on testbench quality\. ### 2\.1\.Known\-Reference RTL Verification We consider the known\-reference RTL verification setting\. Given a design under test \(DUT\)DDand a trusted golden implementationGG, the objective is to automatically construct a testbenchTTthat applies effective stimuli toDD, compares its behavior againstGG, and determines whetherDDis functionally correct\. The generated testbench must satisfy three practical requirements: \(i\) produce trustworthy pass/fail judgments, \(ii\) achieve high behavioral coverage, particularly for control\- and state\-dependent behaviors, and \(iii\) incur low generation cost so that it scales to large batches of RTL candidates\. This setting is broadly applicable in current LLM\-driven RTL workflows, where golden references are routinely available: RTL generation benchmarks ship with reference implementations\(Liu2023VerilogEval;Thakur2024RevisitingVerilogEval;lu2024rtllm;pinckney2025cvdp\), test\-time scaling systems generate candidates against a known specification\(wei2026vflow;min2026revolution\), and data\-curation pipelines filter LLM outputs against trusted solutions\(Yao2025CodeV;QiMeng2025CodeVR1;Chen2026SiliconMindV1\)\. The known\-reference assumption therefore covers the three use cases introduced in Section[1](https://arxiv.org/html/2606.12983#S1):*direct RTL verification*,*verification\-oriented data curation*, and*test\-time scaling*\. We therefore formulate the target problem as “structured testbench generation for verification\-oriented classification”: givenDDandGG, generate a testbench that reveals meaningful behaviors ofDD, produces a trustworthy pass/fail decision, and scales to large numbers of generated RTL candidates\. ### 2\.2\.LLM\-Based Testbench\-Generation Methods A line of prior work—AutoBench\(Liu2024AutoBench\), CorrectBench\(Liu2025CorrectBench\), and ConfiBench\(Liu2025ConfiBench\)—tackles the open\-ended setting where no trusted reference exists, and the LLM must synthesize both stimulus and a*silver reference*oracle from scratch\. A*silver reference*is an alternative implementation of the same specification produced by an LLM \(e\.g\., a behavioral model in C\+\+ or Python\), used as a substitute oracle when no authoritative golden reference is available\. While these methods progressively improve generation quality through self\-correction and ensembling, they share a fundamental ambiguity: when the DUT and the oracle are both produced by the same stochastic process, a mismatch cannot be unambiguously attributed to a bug in the DUT versus an error in the oracle, making the pass/fail verdict inherently unreliable\. The known\-reference setting assumed in this work eliminates this ambiguity by assuming a trusted golden referenceGGin hand, so any discrepancy is definitively a DUT fault\. This shifts the problem from open\-ended code synthesis to efficient, high\-coverage stimulus generation\. ### 2\.3\.Test\-Time Scaling and Verification\-Oriented Data Curation The need for efficient known\-reference verification is amplified by two recent trends in LLM\-driven RTL generation that both rely heavily on verification quality: test\-time scaling and verification\-oriented data curation\. Test\-time scaling\.Recent LLM\-based RTL generation systems have moved beyond one\-shot prompting toward iterative search and refinement at inference time, placing verification inside the optimization loop rather than after it\(wei2026vflow;min2026revolution;Dong2025ScaleRTL\)\. The architectural patterns vary but all share a common requirement: at every iteration, the system must evaluate each candidate and use the result to decide what to generate next\. This turns the testbench into a performance\-critical component of the generation process itself\. A noisy or unreliable verification signal can cause the search to retain faulty candidates, reject correct ones, or waste iterations on ambiguous feedback\. The testbench must therefore be not only correct but also fast to generate, deterministic, and informative enough to distinguish partially correct designs from wholly incorrect ones\. Verification\-oriented data curation\.A parallel development is the growing use of model distillation and reinforcement learning to train small, thinking models that are specialized for RTL generation\(teng2025verirl;Yao2025CodeV;QiMeng2025CodeVR1;Chen2026SiliconMindV1\)\. These pipelines produce large numbers of candidate DUTs, often paired with reasoning traces or auxiliary artifacts, that must be validated and classified before they can serve as training data\. The verification artifacts in the filtering stage, however, are still commonly handled through prompt\-based LLMs, which are expensive to scale when screening large datasets\. Weak or unstable testbenches at this stage can also misclassify candidate designs, introduce noisy labels, and degrade the quality of the curated dataset\(QiMeng2025CodeVR1;Chen2026SiliconMindV1\)\. Currently, no mechanism exists that can filter large numbers of candidates cheaply and reproducibly without requiring per\-task LLM invocation\(QiMeng2025CodeVR1;teng2025verirl;Chen2026SiliconMindV1\)\. Both trends redefine the role of verification in LLM\-driven RTL pipelines\. Verification no longer serves solely to judge whether a generated DUT is correct; it also provides the feedback signal inside search loops and the quality gate for training\-data construction\. The verification engine thus becomes part of the core infrastructure for model improvement, making low\-cost, reproducible, and behaviorally meaningful testbench generation especially valuable\. ## 3\.STG: Structured Testbench Generation Figure 1\.Overall workflow of STG\. STG is mainly designed for the condition which both DUT and golden reference are available, while still can be extended to the silver\-reference setting discussed in§\\S[2\.2](https://arxiv.org/html/2606.12983#S2.SS2)\. Thered linescorrespond to the main workflow of STG; theblue linesindicate the extension to the silver\-reference setting\. The black lines are common steps for both settings\.Fig\.[1](https://arxiv.org/html/2606.12983#S3.F1)shows the overall flow of STG\. The framework takes both the DUT and golden reference as input\. STG first analyzes the HDL structure, then generates a testbench deterministically according to the detected design type, and finally compiles and executes the testbench to obtain pass/fail statistics and coverage information\. The testbench is rendered from parameterizedJinjatemplates populated with the extracted module information, including port lists, signal roles, and design\-type\-specific parameters\. In this section, we first describe the module parsing and analysis process, which extracts the necessary information from the HDL code to guide the generation\. We then present the different stimulus\-generation strategies for combinational, general sequential, and FSM\-dominated designs\. Finally, we discuss how STG can be extended to the setting where no trusted golden reference is available and an LLM\-generated silver reference is used instead, as in the workflows discussed in Section[2\.2](https://arxiv.org/html/2606.12983#S2.SS2)\. ### 3\.1\.Module Parsing and Analysis STG operates in two modes\. In*automatic mode*, the framework analyzes the DUT entirely through heuristics and lightweight LLM queries, requiring no human intervention\. This mode is designed for large\-scale data curation where thousands of modules must be processed without manual effort\. In*interactive mode*, a user may supply additional hints—such as explicit signal roles or design\-type overrides—to improve accuracy for a specific verification task\. Top\-module identification\.STG parses all module instantiations in the input files using Icarus Verilog\(williams2002icarus\)to construct a module\-instantiation directed acyclic graph\. The top module is identified as the root node of this graph\. When multiple roots exist \(e\.g\., if utility modules are also provided\), STG selects the root with the most descendant nodes as the top module\. Design\-type classification\. Table 1\.Design\-type classification and corresponding testbench generation strategy\.STG classifies each design into one of three categories,*combinational*,*general sequential*, or*FSM\-dominated*, since each category requires a different strategy for stimulus generation \(detailed in Sections[3\.2\.1](https://arxiv.org/html/2606.12983#S3.SS2.SSS1)–[3\.2\.3](https://arxiv.org/html/2606.12983#S3.SS2.SSS3)\)\. The classification proceeds as follows\. First, STG checks whether any clock signal is present in the port list\. If no clock is detected, the design is classified as combinational\. Otherwise, STG performs FSM detection to distinguish FSM\-dominated designs from general sequential circuits\. FSM detection uses two complementary methods:\(1\) Deterministic pattern matching\.STG scans the HDL source foralways\_ff\(oralways @\(posedge clk\)\) blocks that containcase/casezstatements indexed by a register whose name matches common state\-variable patterns \(e\.g\.,state\)\. If such a pattern is found, the design is classified as FSM\-dominated\.\(2\) LLM\-assisted analysis\.When deterministic matching is inconclusive, STG issues a structured prompt to an LLM, providing the module source and requesting a JSON description of the FSM, including state variables, encodings, static parameters, and transitions and the associated conditions\. This step extracts the FSM structure needed for targeted stimulus generation \(§[3\.2\.3](https://arxiv.org/html/2606.12983#S3.SS2.SSS3)\)\. If neither method identifies an FSM, the design is classified as general sequential\. Signal classification\. Table 2\.Signal classification heuristics\. All roles use LCS\-based fuzzy matching against name hints \(case\-insensitive\)\.After determining the design type, STG classifies each input port into one of four roles:*clock*,*reset*,*control*, or*data*\(output ports are handled uniformly by the checking logic\)\. Table[2](https://arxiv.org/html/2606.12983#S3.T2)summarizes the heuristics\. All four categories use longest common subsequence \(LCS\)\-based fuzzy matching: each port name is compared against category\-specific hint lists using a LCS similarity score\. Clock and reset signals are identified first and the remaining input signals are classified as control or data using the same LCS\-based matching against their respective hint lists \(Table[2](https://arxiv.org/html/2606.12983#S3.T2)\), combined with a width\-based heuristic: narrow signals receive a higher control score, while wide signals receive a higher data score\. When scores are tied, the signal defaults to control\. In interactive mode, users may override any classification by providing explicit signal\-role mappings\. ### 3\.2\.Testbench Generation Strategies All three strategies share a common template\-based architecture: STG renders testbenches from parameterizedJinjatemplates, filling in module names, port lists, signal roles, and strategy\-specific parameters\. Based on the classified design type, STG selects the appropriate stimulus strategy—exhaustive\-control enumeration for combinational designs, two\-pass clocked stimulus for general sequential designs, or FSM traversal for FSM\-dominated designs—and populates the corresponding template\. Each generated testbench instantiates both the DUT and the golden reference with shared input drivers and separate output wires, and invokes a unified comparison task after every stimulus event\. STG also handles per\-output error counters, which are accumulated throughout the simulation and reported as pass rates at the end\. #### 3\.2\.1\.Combinational For combinational designs, the testbench applies stimulus directly without a clock\. Following the signal partition in Table[2](https://arxiv.org/html/2606.12983#S3.T2), control signals are enumerated exhaustively over all2bc2^\{b\_\{c\}\}combinations \(wherebcb\_\{c\}is the total control\-input width\), and for each control vector, data signals are randomized independently overNsN\_\{s\}samples\. The total number of test vectors is therefore2bc×Ns2^\{b\_\{c\}\}\\times N\_\{s\}\. After each stimulus application, the testbench invokes the comparison task to check all outputs\. To keep simulation cost bounded, STG enforces a configurable upper bound2bmax2^\{b\_\{\\max\}\}on the total vector count, requiring2bc×Ns≤2bmax2^\{b\_\{c\}\}\\times N\_\{s\}\\leq 2^\{b\_\{\\max\}\};NsN\_\{s\}is automatically reduced when this bound would otherwise be exceeded\. This strategy ensures that every reachable control mode is tested, while data\-path behavior within each mode is sampled with high probability\. For designs whose total input width is small enough, treating all inputs as control signals effectively yields exhaustive verification\. #### 3\.2\.2\.General Sequential Sequential designs require clock\-driven simulation and careful handling of resets and output timing\. A key issue is that a sequential design may use synchronous or asynchronous resets, and its outputs may follow Moore semantics \(changing only on clock edges\) or Mealy semantics \(changing combinationally in response to inputs within a cycle\)\. A testbench that only checks outputs at the positive clock edge may miss Mealy\-style output changes, while one that ignores asynchronous reset behavior may miss recovery bugs\. This issue is not well handled by AutoBench and its follow\-ups\(Liu2024AutoBench;Liu2025CorrectBench;Liu2025ConfiBench\), where the LLM is asked to generate clock\-based input stimulus and a Python\-based checker that consumes the DUT outputs cycle by cycle\. That structure naturally assumes observation only at clock boundaries: the generated checker receives one output snapshot per cycle, rather than intermediate within\-cycle responses\. As a result, Mealy\-style behaviors that depend on intra\-cycle input changes are easily overlooked even if the clocked trace appears correct\. Figure 2\.Timing structure of the general sequential strategy\. The signal labeled “mealy” corresponds to a latch\-like within\-cycle response, while the signal labeled “moore” corresponds to an FF\-based registered response\. STG inserts comparison points after intra\-cycle input changes as well as at negative and positive edges, so both behaviors are observed\.As illustrated in Fig\.[2](https://arxiv.org/html/2606.12983#S3.F2), STG addresses these issues through a two\-pass, reset\-aware strategy with multi\-phase comparison points\. Inputs are driven inside the clock period rather than only at the period boundary, and outputs are checked after intra\-cycle input changes and at the clock edges\. This allows the testbench to observe both within\-cycle reactions and edge\-triggered updates, so Mealy\-style behavior is not missed while Moore\-style registered behavior is still verified\. The same framework also handles reset recovery: in the first pass, no resets are injected, allowing the design to accumulate state under sustained stimulus, whereas in the second pass resets are probabilistically inserted between stimulus cycles\. The reset task adapts to the reset type, asserting and releasing the signal at clock boundaries for synchronous resets and exercising short assert–deassert sequences for asynchronous resets\. The stimulus generation in the sequential strategy follows a two\-level randomization structure rather than a single “enumerate control, then randomize data” loop\. Following Table[2](https://arxiv.org/html/2606.12983#S3.T2), STG first performs outer\-loop random data injection before control enumeration, allowing the design to accumulate state under unconstrained data activity and exposing behaviors that are sensitive to prior history\. It then enumerates control inputs exhaustively over all2bc2^\{b\_\{c\}\}combinations\. For each control vector, STG performs an inner loop of data randomization, repeatedly sampling data inputs while holding the control setting fixed\. As a result, the stimulus schedule can be viewed as*random data*→\\rightarrow*control assignment*→\\rightarrow*random data*, rather than a single flat sampling loop\. #### 3\.2\.3\.FSM\-Guided For FSM\-dominated designs, random stimulus is unlikely to reach deep states or exercise rare transitions within a practical number of cycles\. STG uses the extracted FSM structure \(§[3\.1](https://arxiv.org/html/2606.12983#S3.SS1)\) to guide stimulus generation toward full transition coverage\. Figure 3\.Example of FSM\-guided traversal\. STG separates directly drivable input signals from internal wait conditions\.As shown in Fig\.[3](https://arxiv.org/html/2606.12983#S3.F3), the FSM\-guided strategy operates in two stages\. In the*generation stage*, STG extracts a state\-transition graph from the DUT via deterministic pattern matching or LLM\-assisted analysis \(§[3\.1](https://arxiv.org/html/2606.12983#S3.SS1)\), and generates a C\+\+ testbench that encodes the graph\. Each edge guard is split into an*input condition*\(predicates over drivable ports\) and a*wait condition*\(predicates over internal runtime state\)\. For example,ack==0 && cnt\>=3becomes: driveack=0, and wait untilcnt\>=3\. This separation allows STG to drive controllable inputs deterministically while internal conditions are satisfied naturally\. The C\+\+ testbench is then compiled together with the Verilog DUT and golden reference into a single executable via Verilator, which translates Verilog modules into C\+\+ classes and thereby enables high\-level constructs such as recursive traversal\. In the*simulation stage*, the harness traverses the graph by DFS\. At each state, STG parses the input condition into a lightweight AST and performs deterministic constraint extraction to derive concrete signal assignments \(e\.g\., resolvingack==0toack=0\)\. It then drives those assignments and advances the clock until the wait condition is satisfied or a timeout is reached\. If a transition is infeasible or times out, STG resets and backtracks to explore an alternative path, systematically covering all reachable transitions without random exploration\. To verify that each transition is genuinely exercised at the HDL level, we extend Verilator’s coverage API to expose per\-line execution counts at runtime, providing a fine\-grained signal for whether the HDL statements associated with the target transition have actually been reached\. Finally, the testbench reports both pass rates and transition\-coverage statistics at the end\. ### 3\.3\.Extension to the Silver\-Reference Setting This paper focuses on the known\-reference setting, where a trusted golden implementation is available\. Nevertheless, STG’s structural\-analysis and stimulus\-generation pipeline is not inherently tied to this assumption: the golden HDL module can be replaced by a software reference model \(the*silver reference*mentioned in Section[2\.2](https://arxiv.org/html/2606.12983#S2.SS2)\), typically emitted as a C\+\+ or SystemC header\. Figure 4\.Simplified structure of the silver\-reference template\. STG generates a C\+\+ interface with DUT\-aligned inputs and outputs, while the LLM fills in the behavioral logic\.Fig\.[4](https://arxiv.org/html/2606.12983#S3.F4)illustrates this extension\. STG generates a skeleton software\-model interface whose fields mirror the DUT ports and whose hooks align with the testbench’s event structure\. The LLM only needs to fill in the behavioral logic inside this fixed interface; STG preserves the same stimulus schedule and comparison flow used in the golden\-reference setting\. The C\+\+ code is then compiled with Verilog files via Verilator, which enables the conversion of Verilog modules into C\+\+ classes\. Because verification quality now depends on the LLM\-generated reference rather than a trusted golden implementation, this mode trades oracle reliability for broader applicability\. We include it here to show that STG’s architecture generalizes beyond the known\-reference setting evaluated in this work\. ## 4\.Applications of STG The STG framework described in Section[3](https://arxiv.org/html/2606.12983#S3)is not limited to a single benchmark format\. More generally, it provides a structured verification backend for LLM\-driven RTL workflows whenever the main bottleneck is reliable stimulus generation and low\-cost behavioral checking\. We highlight three representative applications\. Replacing ad hoc benchmark testbenches\.Benchmark suites such as VerilogEval\(Liu2023VerilogEval;Thakur2024RevisitingVerilogEval\)and CVDP\(pinckney2025cvdp\)typically rely on hand\-written verification artifacts\. STG can be used directly by benchmark designers as a testbench\-construction interface: given the DUT and reference, it generates a working testbench shell with the appropriate structure for combinational, sequential, or FSM\-dominated designs\. This is useful even in interactive mode, where a human can provide signal\-role hints or design\-type overrides and then build on top of the generated scaffold\. In practice, this means benchmark authors do not need to write every testbench from scratch\. STG can quickly provide the module instantiation, clock/reset handling, and default stimulus structure, after which a human can add extra corner\-case patterns or benchmark\-specific checks if needed\. This reduces manual effort while keeping the final benchmark testbench extensible rather than fully opaque or LLM\-generated end\-to\-end\. Verification\-oriented data curation\.As outlined in Section[2\.3](https://arxiv.org/html/2606.12983#S2.SS3), model\-distillation pipelines generate large numbers of candidate DUTs that must be filtered before they can serve as training data\(Yao2025CodeV;QiMeng2025CodeVR1;teng2025verirl;Chen2026SiliconMindV1\)\. Filtering is still commonly handled through prompt\-based LLMs or LLM\-generated verification artifacts, which are expensive and difficult to scale for large datasets such as PyraNet\(nadimi2025pyranet\)\. Figure 5\.Verification\-oriented data\-curation and training flow\. \(1\) filters the source dataset to retain hard problems; \(2\) generates candidate DUTs with a teacher model and verifies with STG; \(3\) trains the student model on the curated data\.Fig\.[5](https://arxiv.org/html/2606.12983#S4.F5)shows our simple three\-step data\-curation and SFT workflow with STG\. We start from a pool of 692k PyraNet samples and first down\-select about 115k candidates using problem\-difficulty and code\-quality indicators provided by the source dataset due to limited computational resources\. In Step \(1\), STG is used to identify hard problems that are not already solved by the small base models, and samples correctly solved by the base models are removed\. In Step \(2\), a teacher model generates a reasoning trace and Verilog answer for each remaining problem, and STG again uses the golden reference to verify whether the teacher\-produced DUT is correct\. After this verification\-based curation stage, 43k samples remain\. In Step \(3\), the surviving samples are used to train the student model\. STG plays two distinct roles in this workflow\. First, it acts as a difficulty filter by measuring which problems remain unsolved by small base models, allowing us to focus the curation budget on informative training targets\. Second, it serves as the verifier for teacher\-generated answers, retaining only correct solutions\. This makes STG a practical screening engine for large\-scale RTL data curation before SFT, bypassing the need for per\-problem LLM invocation required by recent specialized RTL models\(Chen2026SiliconMindV1;teng2025verirl;QiMeng2025CodeVR1\)\. Verification backend for test\-time scaling\.As discussed in Section[2\.3](https://arxiv.org/html/2606.12983#S2.SS3), recent RTL generation systems increasingly use iterative search and refinement at inference time\(wei2026vflow;min2026revolution;Dong2025ScaleRTL\)\. In these systems, verification is no longer a one\-shot final check but part of the optimization loop: the quality of each search iteration depends directly on the quality of the verification signal\. Figure 6\.Modified MCTS\-based refinement flow with STG as the verification backend\.Fig\.[6](https://arxiv.org/html/2606.12983#S4.F6)shows our modified MCTS\-style refinement loop based on VFlow\(wei2026vflow\)\. Starting from a selected leaf node, the LLM proposes a modified RTL candidate, which is then verified by STG through testbench generation and RTL simulation\. The reported score is propagated back along the search path and used to guide subsequent node selection\. In this flow, STG serves as a drop\-in replacement for the benchmark\-provided testbench\. Compared with a fixed benchmark testbench, STG explores a wider set of scenarios and provides more concrete feedback about candidate behavior\. This gives the search loop a stronger signal, allowing it to reject weak candidates earlier, guide refinement more effectively, and reach correct designs with fewer iterations and lower token cost\. Figure 7\.Race condition in VerilogEval testbenches and its fix\. We manually insert\#1\(highlighted\) after the clock edge\. ## 5\.Experimental Results and Evaluation In this section, STG is evaluated across the three application scenarios described in Section[4](https://arxiv.org/html/2606.12983#S4):\(1\) Testbench quality and DUT classification \(§[5\.2](https://arxiv.org/html/2606.12983#S5.SS2)\): STG is benchmarked against a ConfiBench\-style\(Liu2025ConfiBench\)iterative LLM testbench generation pipeline on VerilogEval, followed by a coverage analysis contrasting STG’s sequential\-random and FSM\-guided strategies on a deep\-state FSM design\.\(2\) Verification\-oriented data curation \(§[5\.3](https://arxiv.org/html/2606.12983#S5.SS3)\): STG serves as the verification engine for large\-scale training\-data filtering, and the resulting distilled models are evaluated against state\-of\-the\-art specialized small language models\.\(3\) Test\-time scaling \(§[5\.4](https://arxiv.org/html/2606.12983#S5.SS4)\): STG replaces the benchmark\-provided testbench as the verification backend in an MCTS\-based code refinement loop, and STG’s search efficiency is measured across four backbone language models\. ### 5\.1\.Experimental Setup All LLM inference experiments use GPT\-OSS\-120B\(openai2025gptoss\)running on one NVIDIA GB200 GPU\. STG’s deterministic pipeline \(parsing, signal classification and template rendering\) runs on a single CPU core \(Intel Xeon w9\-3475X, max 4\.8 GHz\); for FSM\-dominated designs, STG additionally invokes GPT\-OSS\-120B to extract the state\-transition graph\. For model distillation, training data is sourced from PyraNet\(nadimi2025pyranet\), a large\-scale dataset for RTL generation training; we use GPT\-OSS\-120B as the teacher and fine\-tune three student models—Qwen2\.5\-Coder\-7B\-Instruct, Qwen3\-4B\-Thinking, and Qwen3\-8B\(hui2024qwen25codertechnicalreport;yang2025qwen3technicalreport\)—on 16 NVIDIA H100 GPUs\. Our training recipe is intentionally simple: after STG\-based data curation, each student is trained with only an SFT stage\. We compare against recent specialized small LMs that use more complex fine\-tuning pipelines, including multi\-stage SFT \(SiliconMind\-V1\(Chen2026SiliconMindV1\)\) and combined SFT and RL methods \(CodeV\-R1\(QiMeng2025CodeVR1\)and VeriRL\(teng2025verirl\)\)\. Importantly, this comparison is not driven by giving STG newer training data than previous works\. For test\-time scaling, we use four backbone LLMs spanning a wide range of model sizes and training recipes: SiliconMind\-V1\-7B\(Chen2026SiliconMindV1\), GPT\-OSS\-120B\(openai2025gptoss\), DeepSeek\-R1\-FP4\-685B\(Guo2025deepseek\), and one of our STG\-curated distilled models\. We use Verilator\(Snyder2024Verilator\), an open\-source Verilog simulator, to perform RTL simulations; line and toggle coverage metrics are collected through Verilator’s built\-in coverage instrumentation\. All three experiment tracks use VerilogEval\(Liu2023VerilogEval;Thakur2024RevisitingVerilogEval\)\(156 problems\)\. The model\-distillation experiments \(§[5\.3](https://arxiv.org/html/2606.12983#S5.SS3)\) additionally evaluate on RTLLM\-v2\(lu2024rtllm\)\(50 problems\) and CVDP\(pinckney2025cvdp\)categories cid02 and cid03 \(172 problems\), which cover non\-agentic code completion and generation tasks suited to our target: RTL generation\. CVDP is a newer and harder benchmark that is not used by prior works\(QiMeng2025CodeVR1;teng2025verirl\)\. Note on VerilogEval testbench correctness\.During our evaluation, we identified cases where both manual inspection and STG agreed that a generated DUT was functionally correct, yet VerilogEval’s original testbench reported a failure\. The root cause is a race condition: as shown in Fig\.[7](https://arxiv.org/html/2606.12983#S4.F7), both the stimulus and checker blocks trigger on the same clock edge with no ordering guarantee, so the checker may compare a newly driven input against a stale reference\-model output\. The fix is to insert a single\#1delay in the stimulus block after the clock edge, ensuring all reference evaluations complete before new inputs are driven\. All VerilogEval results reported in this paper use our manually corrected testbenches\. Table 3\.Testbench generation comparison on VerilogEval\. Table 4\.DUT classification accuracy\. Table 5\.Pass@k \(%\) before and after training, grouped by base model\. We report pass@k withn\(number of samples\)=20n\\text\{ \(number of samples\)\}=20\.RoleModelSFTRLRTLLM\-v2\(lu2024rtllm\)VerilogEval\-v2\(Liu2023VerilogEval;Thakur2024RevisitingVerilogEval\)CVDP\(pinckney2025cvdp\)Z\-score \(%\)p@1p@5p@10p@1p@5p@10p@1p@5p@10p@10TeacherGPT\-OSS\-120B\(openai2025gptoss\)––69\.978\.180\.889\.696\.797\.642\.957\.962\.294BaseQwen2\.5\-C\-7B\-Instruct\(hui2024qwen25codertechnicalreport\)––29\.348\.656\.033\.653\.760\.113\.625\.129\.8\-151Qwen3\-4B\-Thinking\(yang2025qwen3technicalreport\)––36\.450\.956\.321\.430\.433\.415\.424\.829\.1\-200Qwen3\-8B\(yang2025qwen3technicalreport\)––40\.261\.167\.652\.565\.469\.117\.428\.734\.4\-80Fine\-tunedBase: Qwen2\.5\-C\-7B\-InstructCodeV\-R1\(QiMeng2025CodeVR1\)✓✓68\.077\.680\.773\.283\.686\.634\.550\.454\.854VeriRL \(paper\)\(teng2025verirl\)✓✓63\.370\.3–67\.276\.1–––––↪\\hookrightarrow\(reproduced\)✓✓71\.877\.678\.860\.874\.878\.818\.127\.531\.9\-29SiliconMind\-V1\(Chen2026SiliconMindV1\)✓×\\times63\.874\.075\.973\.983\.685\.831\.347\.552\.930STG \(Ours\)✓×\\times63\.176\.479\.070\.584\.989\.432\.450\.556\.056Base: Qwen3\-4B\-ThinkingSiliconMind\-V1\(Chen2026SiliconMindV1\)✓×\\times67\.975\.376\.082\.089\.691\.033\.447\.351\.937STG \(Ours\)✓×\\times67\.578\.279\.880\.090\.291\.535\.652\.457\.968Base: Qwen3\-8BSiliconMind\-V1\(Chen2026SiliconMindV1\)✓×\\times66\.674\.976\.581\.089\.892\.434\.449\.253\.846STG \(Ours\)✓×\\times68\.779\.981\.980\.289\.992\.036\.552\.958\.177 The final column reports the mean Z\-score of pass@10 across the three benchmarks as a single aggregate metric: for each benchmark we computez=\(x−μ\)/σz=\(x\-\\mu\)/\\sigma, then average the three resulting Z\-scores\.Colors denote rankings among all fine\-tuned models:first,second, andthird\. Bold marks the best within each base\-model group\. ### 5\.2\.Testbench Quality and DUT Classification We first evaluate STG as a direct replacement for human\-crafted testbenches on VerilogEval\. For each of the 156 problems, we use GPT\-OSS\-120B to generate approximately 10 correct and 10 incorrect variants from the golden reference, yielding 3,046 DUTs in total\. Each DUT is verified by two methods: \(1\)*Pure\-LLM*, a ConfiBench\-style\(Liu2025ConfiBench\)prompt\-based testbench with up to 5 iterative refinement rounds, and \(2\)*STG*, a single\-pass STG\-generated testbench\. Table[4](https://arxiv.org/html/2606.12983#S5.T4)summarizes the generation cost and coverage metrics\. STG generates testbenches𝟕𝟐𝟎×\\mathbf\{720\\times\}faster than the iterative LLM approach while achieving higher line and toggle coverage \(\+1\.9 and \+10\.4 pp\): STG exhaustively enumerates all combinations of control\-flow signals, guaranteeing that every control path is exercised at least once for combinational designs, whereas the stochastic LLM testbench may leave rare control states untested\. Table[4](https://arxiv.org/html/2606.12983#S5.T4)breaks down the classification outcomes into four categories\. STG and the LLM\-based testbench agree on 91\.4% of cases\. In the 7\.8% of cases where only STG succeeds, the dominant failure mode is the LLM testbench producing a false PASS on an incorrect DUT \(193 out of 236 cases\), confirming that stochastic testbenches are unreliable at detecting subtle bugs\. The remaining failures, 0\.9% where only STG fails and 1\.6% where both fail, share the same reason: bugs that require exhaustive state\-space enumeration to expose, beyond the reach of either structured or stochastic stimulus\. Figure 8\.State visit counts under STG\-Sequential \(random stimulus\) and STG\-FSM \(guided traversal\) for a 15\-state Mealy sequence detector\. Random stimulus visits decay exponentially and fail to reach states S11–S14\.#### 5\.2\.1\.Coverage on FSM\-Dominated Designs To illustrate when the FSM\-guided strategy \(§[3\.2\.3](https://arxiv.org/html/2606.12983#S3.SS2.SSS3)\) is most valuable, we compare the two STG modes on a 15\-bit sliding\-window Mealy sequence detector with 15 states \(S0–S14\) and 30 transitions\. This design requires a specific 15\-bit input sequence to trigger the detection output, a scenario where random stimulus is exponentially unlikely to succeed\. Fig\.[8](https://arxiv.org/html/2606.12983#S5.F8)shows the per\-state visit counts\. Under STG\-Sequential with random stimulus, visits decay exponentially and states S11–S14 are never entered\. In contrast, STG\-FSM performs DFS passes over the extracted transition graph, achieving 100% transition coverage\. This result shows that FSM\-guided traversal is essential for designs with deep state spaces that random stimulus cannot penetrate\. In the main experiments in Table[4](https://arxiv.org/html/2606.12983#S5.T4), all 156 VerilogEval problems are verified using the general sequential strategy, which already achieves high coverage on the benchmark’s predominantly shallow\-state designs\. The FSM\-guided mode serves as a complementary strategy for a subset of designs where targeted state exploration is required\. Table 6\.Resource comparison for testbench generation on 115k problems: pure\-LLM \(GB200\) vs\. STG \(a CPU core\)\.Table 7\.Pass rate \(%\) at 256 search nodes\.  Figure 9\.Percentage of correctly solved problems vs\. search node budget for four backbone models\. Figure 10\.Node\-count distribution for non\-trivial and solved problems for each model\. Outliers beyond1\.5×1\.5\\timesthe interquartile range \(IQR\) are suppressed for readability\. ### 5\.3\.Verification\-Oriented Data Curation We evaluate STG as the verification engine for large\-scale data curation in a model\-distillation pipeline, as illustrated in Fig\.[5](https://arxiv.org/html/2606.12983#S4.F5)\. Table[6](https://arxiv.org/html/2606.12983#S5.T6)compares the resource footprint of testbench generation on 115k problems\. The pure\-LLM baseline uses single\-pass generation on an GB200 GPU without iterative refinement, of which only 71\.3% produce compilable testbenches, while STG guarantees compilable output by construction\. STG on a single CPU core completes the task in 5\.6 hours compared to 59\.1 hours for the LLM baseline \(10\.6×10\.6\\timesspeedup\)\. Because STG runs on a CPU core \(≈\{\\approx\}100 W\) rather than a 1,200 W GPU, STG provides total energy reduction by127×127\\times\(from70\.970\.9to0\.560\.56kWh\), on hardware that costs 15×\\timesless\. Moreover, STG’s pipeline is trivially parallelizable via CPU multiprocessing for further speedup with minimal engineering effort\. Model training results\.Table[5](https://arxiv.org/html/2606.12983#S5.T5)presents our fine\-tuning results, grouped by base model to facilitate direct comparison\. We reportpass@k=𝔼\[1−\(n−ck\)/\(nk\)\]\\text\{pass\}@k=\\mathbb\{E\}\\\!\\left\[1\-\\binom\{n\-c\}\{k\}\\\!/\\binom\{n\}\{k\}\\right\], the unbiased estimator of the probability that at least one ofkksamples passes, wherennis the total number of generated samples andccis the number of successful ones\. As a single aggregate metric across the three benchmarks, our STG\-trained models achieve the top three mean Z\-scores for pass@10\. Despite relying on only a single SFT stage after STG\-based data curation, our models remain competitive with or outperform more complex multi\-stage SFT\(Chen2026SiliconMindV1\)and SFT\+RL\(QiMeng2025CodeVR1;teng2025verirl\)pipelines\. On Qwen2\.5\-Coder\-7B\-Instruct, our model surpasses previous work on VerilogEval and CVDP at pass@5 and pass@10\. On the Qwen3 series, our models achieve the strongest CVDP results and the best pass@5/pass@10 on VerilogEval within each base\-model group\. While RL\-based methods perform well on RTLLM \(a 2024 benchmark\), their complexity is not justified by consistent gains on the newer 2025 benchmarks, VerilogEval\-v2 and CVDP\. We also encountered substantial reproducibility issues with VeriRL\. Relative to the numbers presented in the paper\(teng2025verirl\), our replicated VeriRL checkpoint scores significantly higher on RTLLM\-v2 but worse on VerilogEval\-v2 even after applying our VerilogEval testbench fix, whereas the other evaluated models consistently improve under the corrected benchmark\. Combined with VeriRL’s weak transfer to VerilogEval\-v2 and CVDP, this discrepancy suggests that the released model may overfit artifacts specific to RTLLM rather than delivering robust gains across newer benchmarks\. Overall, the results demonstrate that a simple data curation pipeline powered by STG can yield strong and competitive distilled models with only one simple SFT stage, without the need for complex multi\-stage SFT and RL\-centric training workflows\. ### 5\.4\.Test\-Time Scaling We integrate STG into an MCTS\-based test\-time scaling refinement loop based on VFlow\(wei2026vflow\), as illustrated in Fig\.[6](https://arxiv.org/html/2606.12983#S4.F6), and compare it against using the benchmark\-provided testbench as the verification oracle\. Experiments are conducted on our modified VerilogEval with four backbone LLMs: three prior models \(SiliconMind\-V1\-7B, GPT\-OSS\-120B, DeepSeek\-R1\-685B\) and our STG\-curated distilled model from Section[5\.3](https://arxiv.org/html/2606.12983#S5.SS3)\(STG\-Qwen3\-4B\-Thinking\)\. For each problem, the search expands nodes until the candidate DUT passes the testbench or a budget of 256 nodes is exhausted\. Table[7](https://arxiv.org/html/2606.12983#S5.T7)reports the pass rate at the full 256\-node budget, and Fig\.[9](https://arxiv.org/html/2606.12983#S5.F9)shows the number of search nodes required to reach each pass\-rate percentile in the 70–100% range\. Across all four backbone LLMs, STG matches or improves the pass rate \(Table[7](https://arxiv.org/html/2606.12983#S5.T7)\) and reduces the node count at most percentiles \(Fig\.[9](https://arxiv.org/html/2606.12983#S5.F9)\)\. Fig\.[10](https://arxiv.org/html/2606.12983#S5.F10)further details the node\-count distribution for non\-trivial solved problems \(i\.e\., those requiring more than one search node\), showing that STG lowers the mean node count by 14–47% and compresses both the interquartile range and median count\. Because STG tests more patterns and reports per\-output\-port pass rates, the verification signal is more informative and guides LLM to search more efficiently\. Overall, STG’s contribution to test\-time scaling is twofold: it increases the final pass rate and reduces per\-problem search cost\. ## 6\.Conclusion This paper presents STG, a structured testbench generation framework that treats module\-level RTL verification as a structured generation problem rather than unconstrained code synthesis\. Powered by design type\-specific template\-based rendering, STG produces testbenches deterministically at720×720\\timesthe speed of iterative LLM approaches with higher coverage\. Across three application scenarios, STG consistently outperforms LLM\-based alternatives at a fraction of the cost: it detects 7\.8% more incorrect DUTs, reduces MCTS search node count by 14–47% on large backbone models, and enables large\-scale data curation11×11\\timesfaster on a single CPU core than LLM\-based filtering while supporting strong distilled models with only a SFT stage\. These results establish STG as a practical, low\-cost verification backbone for LLM\-driven HDL workflows, also suggesting that the effectiveness of recent complex RL training workflows remains questionable, especially on newer benchmarks where our simpler pipeline provides competitive performance\. Future work includes integration with reliable FSM extraction for complex production RTL\. Additionally, as LLM\-driven hardware design moves toward continuous learning—where models are iteratively retrained on newly generated data—efficient and reliable data curation becomes increasingly critical; STG’s low\-cost verification pipeline is well positioned to support such end\-to\-end workflows\. Finally, the strong HDL\-specialized small language models produced by STG\-curated distillation are natural candidates for speculative decoding, where a lightweight draft model accelerates inference of a larger backbone while preserving exact output quality\. ###### Acknowledgements\. We acknowledge the financial support from Academia Sinica’s SiliconMind Project \(AS\-IAIA\-114\-M11\)\. This work was also supported in part by the National Science and Technology Council, Taiwan \(112\-2221\-E\-002\-159\-MY3\), as well as the National Center for High\-performance Computing and Taipei\-1 for computational resources\. ## References
Similar Articles
Alpha-RTL: Test-Time Training for RTL Hardware Optimization
Alpha-RTL (TTT-RTL) introduces a test-time training framework for RTL hardware optimization, using reinforcement learning with EDA feedback to refine LLM-generated designs. It achieves significant PPA reductions on benchmarks.
RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision
RTL-BenchMT is an agentic framework that automatically identifies and revises flawed cases and detects overfitting in RTL generation benchmarks, reducing human maintenance effort in EDA research.
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
This paper introduces LGMT, a framework that uses first-order logic to generate semantically invariant test cases for evaluating LLM reasoning reliability. Experiments on six LLMs show that LGMT exposes hidden defects missed by static benchmarks, suggesting evaluation should focus on robustness under logical invariance.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.
MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
The article introduces MANTRA, a framework for automatically synthesizing SMT-validated compliance benchmarks for tool-using LLM agents from natural language manuals. It demonstrates that this approach enables scalable and reliable evaluation of agent adherence to complex procedural rules.