Alpha-RTL: Test-Time Training for RTL Hardware Optimization

arXiv cs.LG 06/05/26, 04:00 AM Papers
Summary
Alpha-RTL (TTT-RTL) introduces a test-time training framework for RTL hardware optimization, using reinforcement learning with EDA feedback to refine LLM-generated designs. It achieves significant PPA reductions on benchmarks.
arXiv:2606.05253v1 Announce Type: new Abstract: Large language models (LLMs) have shown increasing promise in generating functionally correct register-transfer-level (RTL) hardware designs. Recent systems improve further through EDA-integrated reinforcement learning with syntax, simulation, and PPA rewards, but train a general RTL generator before deployment while test-time approaches search with a frozen policy. We instead perform reinforcement learning at test time, allowing the LLM policy to adapt to executable EDA feedback for the specific RTL problem at hand. We propose TTT-RTL, to our knowledge the first per-design test-time training framework that closes the loop between an LLM policy and an EDA pipeline for RTL optimization. TTT-RTL samples candidate implementations, verifies them through syntax checking and simulation, scores valid designs using synthesis-derived PPA product, reuses high-reward variants through a PUCT-indexed design-state pool, and updates the policy with an entropic policy-gradient objective. To stabilize policy updates under sparse or plateaued rewards, we introduce an adaptive KL-budget controller that adjusts the entropy constraint using reference KL, effective sample size, and reward saturation signals. On RTLLM v2.0 under Nangate 45nm, TTT-RTL reduces the geometric-mean PPA product by 65.1% over the reference, outperforming the strongest published frozen-policy agent baseline at 26.1%. On an industrial XuanTie C910 FPU leading-zero-anticipation unit under Sky130, TTT-RTL achieves a 59.4% ADP reduction, and ablations confirm that policy adaptation, state reuse, and KL-budget control each contribute. These results suggest that test-time training with executable EDA feedback can move LLM-based RTL generation beyond functional correctness toward physically optimized hardware.
Original Article
View Cached Full Text
Cached at: 06/05/26, 08:09 AM
# Alpha-RTL: Test-Time Training for RTL Hardware Optimization
Source: [https://arxiv.org/html/2606.05253](https://arxiv.org/html/2606.05253)
Peilong Zhou1,2,3, Zhirong Chen1,2, Cangyuan Li1,2, Haoyu Gao1,2,3, Kaiyan Chang1,2, Ziming Qu1,2, Ying Wang1,2 1SKLP, Institute of Computing Technology, Chinese Academy of Sciences 2University of the Chinese Academy of Sciences3School of Advanced Interdisciplinary Sciences Correspondence:[wangying2009@ict\.ac\.cn](https://arxiv.org/html/2606.05253v1/mailto:[email protected])

###### Abstract

Large language models \(LLMs\) have shown increasing promise in generating functionally correct register\-transfer\-level \(RTL\) hardware designs\. Recent LLM\-for\-RTL systems further improve this ability through Verilog\-domain distillation, RLVR with automated testbenches, or EDA\-integrated reinforcement learning with syntax, simulation, and PPA rewards\. However, these methods primarily train a general RTL generator before deployment, while test\-time approaches typically search with a frozen policy\. We instead perform reinforcement learning at test time, allowing the LLM policy to adapt to executable EDA feedback from the specific RTL optimization problem\. This setting is challenging because candidate designs must pass sparse discrete validity gates—syntax checking and simulation—before receiving meaningful synthesis\-derived feedback on area, timing, or power\. We proposeTTT\-RTL, to our knowledge the first*per\-design*test\-time training framework that closes the loop between an LLM policy and an EDA pipeline for RTL optimization\. TTT\-RTL samples candidate implementations, verifies them through syntax checking and simulation, scores valid designs using synthesis\-derived PPA product, reuses high\-reward variants through a PUCT\-indexed design\-state pool, and updates the policy with an entropic policy\-gradient objective\. To stabilize policy updates under sparse or plateaued reward groups, we introduce an adaptive KL\-budget controller that adjusts the entropy constraint using reference KL, effective sample size, constant\-reward fraction, and beta\-search saturation\. On RTLLM v2\.0 under Nangate 45 nm, TTT\-RTL reduces the geometric\-mean PPA product by65\.1%65\.1\\%over the reference, outperforming the strongest published frozen\-policy agent baseline under the same reference\-normalized metric at26\.1%26\.1\\%\. On an industrial XuanTie C910 FPU leading\-zero\-anticipation unit under Sky130, TTT\-RTL achieves a59\.4%59\.4\\%ADP reduction over the original implementation, and ablations show that policy adaptation, state reuse, and KL\-budget control each contribute to the gain\. These results suggest that test\-time training with executable EDA feedback can move LLM\-based RTL generation beyond functional correctness toward physically optimized hardware\.

## 1Introduction

Hardware design is a laborious process whose quality is ultimately judged by physical metrics: area, clock frequency, and power consumption\. Modern EDA flows translate a register\-transfer\-level \(RTL\) description written in Verilog or VHDL into a physical circuit; the resulting*PPA product*\(Area×\\timesDelay×\\timesPower, the joint figure of merit used by RTLLM v2\.0 reports\) depends not only on the logical correctness of the code but on fine\-grained micro\-architectural choices— pipeline depth, operator scheduling, signal encoding—that are difficult to predict without running synthesis\.

Recent work has shown that LLMs can generate functionally correct RTL code\(Liuet al\.,[2023b](https://arxiv.org/html/2606.05253#bib.bib2),[c](https://arxiv.org/html/2606.05253#bib.bib3); Blockloveet al\.,[2023](https://arxiv.org/html/2606.05253#bib.bib4)\)\. These models learn to imitate high\-quality Verilog from training corpora, and when prompted with a hardware specification they can produce designs that pass functional simulation\. However, functional correctness is a necessary but insufficient condition for hardware quality\. A multiplier that passes all testbench cases may still be 30% larger or 30% slower than an optimized reference implementation\. Because physical synthesis metrics are computed by EDA tools at evaluation time—not during LLM training—existing models cannot learn to minimize the PPA product\.

Existing approaches to optimize RTL fall into two families, each capturing only half of what is needed\.Frozen\-LLM search agents—EvolVE\(Hsinet al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib20)\), VeriAgent\(Wanget al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib21)\), and the current SOTA REvolution\(Minet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib22)\)—drive an LLM through evolutionary or reflective loops with synthesis feedback, but the weights never update: search quality is bounded by the unchanged base policy, and the design knowledge accumulated from EDA tools is discarded the moment the run ends\.Training\-time RL methods—ChipSeek\(Chenet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib19)\)and EARL\(Shiet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib35)\)—do learn from synthesis signals, but they ship a single amortized policy evaluated under Pass@kk: each candidate is an independent single\-shot generation, with no mechanism to feed EDA results from one attempt back into the next on the same design\.

We argue the missing regime is to*fuse search into training at test time*: given one design and its reference, run a multi\-round search whose every rollout’s EDA feedback updates the model*online for that problem*, so the policy itself deepens as the search progresses\. Concretely, we treat Verilog candidates as nodes in a PUCT tree whose every expansion is scored by a three\-stage EDA pipeline \(syntax→\\tosimulation→\\tosynthesis\), and we feed that dense physical reward back into online policy\-gradient updates on the same design\. The result is a closed loop in which exploration and training co\-evolve under EDA\-grounded feedback, rather than search being grafted on top of a frozen policy\. On RTLLM v2\.0 this regime substantially outperforms the strongest published agent baseline on PPA\-product reduction, and on a XuanTie C910 industrial floating\-point unit it improves over well\-tuned production RTL—we quantify both below\.

#### Contributions\.

1. 1\.We formulate per\-design RTL optimization as a test\-time training problem, fusing PUCT\-guided search into online policy updates driven by EDA feedback\.
2. 2\.We proposeTTT\-RTL, instantiating this regime with a Verilog state pool, EDA\-feedback prompting, an entropic advantage estimator, and a three\-stage syntax/simulation/ synthesis reward, built onverl\(Shenget al\.,[2024](https://arxiv.org/html/2606.05253#bib.bib14)\)\.
3. 3\.OnRTLLM v2\.0TTT\-RTL covers48/4948/49designs and cuts the PPA product by a geometric\-mean of65\.1%65\.1\\%vs\.26\.1%26\.1\\%for the strongest published agent baseline \(REvolution\); on aXuanTie C910industrial LZA unit it improves over well\-tuned production RTL by59\.4%59\.4\\%ADP at seed4242, with single\-seed component ablations isolating the contribution of the PUCT pool and the entropic estimator\.

## 2Related Work

### 2\.1Test\-time training and RL with verifiable rewards

Test\-time training \(TTT\)\(Sunet al\.,[2020](https://arxiv.org/html/2606.05253#bib.bib18)\)updates model parameters at inference time using an auxiliary self\-supervised objective constructed from the test input\.Yuksekgonulet al\.\([2026](https://arxiv.org/html/2606.05253#bib.bib1)\)extended TTT to discrete reasoning viaPUCT\-guided exploration: a verifier scores candidate solutions and policy\-gradient updates improve the generation policy for the specific problem at hand, enabling LLMs to solve mathematical problems beyond their static training distribution\. TTRL\(Zuoet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib34)\)similarly performs test\-time RL using majority\-vote pseudo\-labels on unlabeled data, whileSnellet al\.\([2024](https://arxiv.org/html/2606.05253#bib.bib10)\)andWuet al\.\([2024](https://arxiv.org/html/2606.05253#bib.bib11)\)scaletest\-time computewithout updating the model at all\. On the optimizer side, GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.05253#bib.bib9)\)provides the group\-relative policy\-gradient recipe we build on, originally developed for verifiable mathematical correctness, and CodeRL\(Leet al\.,[2022](https://arxiv.org/html/2606.05253#bib.bib33)\)is the closest precedent in program synthesis, using unit\-test pass/fail as the RL reward signal\.

### 2\.2LLMs for RTL and PPA optimization

#### Verilog generation targeting functional correctness\.

A growing body of work applies LLMs to hardware description language generation and evaluates them on open benchmarks\. VerilogEval\(Liuet al\.,[2023b](https://arxiv.org/html/2606.05253#bib.bib2)\)measures pass rates on 156 HDLBits tasks, while RTLLM\(Luet al\.,[2024](https://arxiv.org/html/2606.05253#bib.bib28)\)and its v2\.0 successor OpenLLM\-RTL\(Liuet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib15)\)extend the setting to larger designs with synthesis feedback; RTLLM v2\.0 is the benchmark we use throughout this paper\. On the generation side, RTLCoder\(Liuet al\.,[2023c](https://arxiv.org/html/2606.05253#bib.bib3)\), Chip\-Chat\(Blockloveet al\.,[2023](https://arxiv.org/html/2606.05253#bib.bib4)\), ChipNeMo\(Liuet al\.,[2023a](https://arxiv.org/html/2606.05253#bib.bib29)\), and BetterV\(Peiet al\.,[2024](https://arxiv.org/html/2606.05253#bib.bib30)\)fine\-tune or steer LLMs for Verilog but targetfunctional correctness onlyand do not optimize physical metrics\.

#### Frozen\-policy agents for PPA\.

A more recent line targets physical optimization directly, using LLMs asfrozencomponents inside a search or multi\-agent loop: EvolVE\(Hsinet al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib20)\)\(evolutionary code generation\), VeriAgent\(Wanget al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib21)\)\(multi\-agent system with synthesis\-feedback critic\), REvolution\(Minet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib22)\)\(population\-based search with reflection, current SOTA\), and COEVO\(Pinget al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib23)\)\(co\-evolution of correctness and PPA\)\. These four are the headline baselines for our RTLLM v2\.0 comparison; in all casesLLM weights are never updated\.

#### Training\-time RL on RTL\.

ChipSeek\(Chenet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib19)\)is the closest RL\-based prior work: it trains RTL\-generation models with hierarchical EDA feedback and reports 84\.09% Pass@5 on RTLLM v2\.0, with average normalized EDAP reduced to 0\.76 in its best configuration\. EARL\(Shiet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib35)\)is a concurrent training\-time RL method that combines SFT with entropy\-aware DAPO\-style updates on verifiable compiler/testbench signals\. Both produce anoffline, amortized policyshared across all designs and are evaluated under a Pass@kkgeneration protocol\.

#### Differentiation\.

Unlike functional\-only Verilog generators, frozen\-LLM PPA agents whose weights never update, and offline RL methods that ship a single amortized policy, TTT\-RTL combines*EDA\-grounded reward*with*per\-design test\-time adaptation*: each problem triggers its own on\-the\-fly policy update driven by a PUCT state pool, with a verifier gated by mixed discrete–continuous physical metrics rather than scalar pass/fail, and we report coverage and reference\-normalized PPA\-product over the full RTLLM v2\.0 benchmark rather than Pass@kk\.

## 3Method: TTT\-RTL

![Refer to caption](https://arxiv.org/html/2606.05253v1/x1.png)Figure 1:Overview of TTT\-RTL\. The framework comprises four parts: Problem Assets \([Section˜3\.1](https://arxiv.org/html/2606.05253#S3.SS1)\), the PUCT State Pool with EDA\-feedback prompt builder \([Section˜3\.2](https://arxiv.org/html/2606.05253#S3.SS2)\), the LLM rollout and three\-stage EDA reward pipeline \([Section˜3\.3](https://arxiv.org/html/2606.05253#S3.SS3)\), and the test\-time policy update with state\-pool admission \([Section˜3\.4](https://arxiv.org/html/2606.05253#S3.SS4)\)\.[Figure˜1](https://arxiv.org/html/2606.05253#S3.F1)summarizes the four parts of the closed\-loop TTT\-RTL pipeline; the rest of this section formalizes each component\.

### 3\.1Problem Formulation and PUCT\-Guided State Sampling

Let𝒫\\mathcal\{P\}denote an RTL design problem specified by a natural\-language description, a functional testbench, and a reference implementation\. A*design state*s=\(v,r,t\)s=\(v,r,t\)consists of a Verilog implementationvv, its rewardrr, and creation timesteptt\. The goal is to find a designv∗v^\{\*\}that is functionally correct and minimizes a target metricM\(v\)M\(v\)defined in[Section˜3\.3](https://arxiv.org/html/2606.05253#S3.SS3)\. We initialize a*state pool*𝒮\\mathcal\{S\}with a single root states0=\(ϵ,0,−1\)s\_\{0\}=\(\\epsilon,\\,0,\\,\-1\)\(ϵ\\epsilonis an empty implementation\), and grow𝒮\\mathcal\{S\}iteratively by sampling parents under a PUCT score that balances exploitation \(high\-reward nodes\) and exploration \(rarely\-sampled nodes\):

PUCT\(s\)=Q\(s\)\+c⋅σ⋅P\(s\)⋅1\+T1\+N\(s\),\\mathrm\{PUCT\}\(s\)=Q\(s\)\+c\\cdot\\sigma\\cdot P\(s\)\\cdot\\frac\{\\sqrt\{1\+T\}\}\{1\+N\(s\)\},\(1\)wherec=1\.0c=1\.0is the exploration coefficient,TTis the total number of expanded parents so far, andN\(s\)N\(s\)counts how oftenss\(or any of its descendants\) has been expanded\.Q\(s\)Q\(s\)is the best one\-step reachable child reward ofssfor visited states, falling back toR\(s\)R\(s\)whensshas not yet been expanded\. The scale factorσ=Rmax−Rmin\\sigma=R\_\{\\max\}\-R\_\{\\min\}is the reward range taken globally over the current pool𝒮\\mathcal\{S\}, andP\(s\)P\(s\)is the normalized linear\-rank prior

P\(s\)=\|𝒮\|−rank⁡\(s\)∑s′∈𝒮\(\|𝒮\|−rank⁡\(s′\)\),rank⁡\(s\)∈\{0,…,\|𝒮\|−1\},P\(s\)=\\frac\{\|\\mathcal\{S\}\|\-\\operatorname\{rank\}\(s\)\}\{\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}\\bigl\(\|\\mathcal\{S\}\|\-\\operatorname\{rank\}\(s^\{\\prime\}\)\\bigr\)\},\\qquad\\operatorname\{rank\}\(s\)\\in\\\{0,\\ldots,\|\\mathcal\{S\}\|\-1\\\},\(2\)with states sorted in descending reward order so thatrank⁡\(s\)=0\\operatorname\{rank\}\(s\)=0is the best state\(Yuksekgonulet al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib1), App\. A\.2\)\. To maintain batch diversity, TTT\-RTL applies*lineage blocking*: no two states from the same direct parent\-child chain appear in one batch\.

### 3\.2EDA\-Feedback Prompt Construction

The prompt for each parent statesscontains the reference implementation with its synthesis metrics, the design specification, and—for non\-root states—the parent’s Verilog and its EDA feedback \(root prompts elide the parent block\):

> \[Reference\] area=\{A\_ref\}, delay=\{D\_ref\}, power=\{P\_ref\},MM=\{M\_ref\}\. \[Spec\] \{design\_description\} \[Previous\] \{verilog\_code\} \[Feedback\] Syntax: PASS\. Functional: PASS\. Synthesis: area=\{A\}, delay=\{D\}, power=\{P\},MM=\{M\} \(improved by \{Δ\\Delta\}\)\. Produce a Verilog module withMMlower than \{M\}\.

Stating the target as a concrete numerical threshold grounds the generation objective in measurable physical quantities\.

### 3\.3Three\-Stage Reward Function

Each implementationvvis evaluated through three stages, with the reward

r=ωsynrsyn\+ωfuncrfunc\+ωpparppa,\(ωsyn,ωfunc,ωppa\)=\(0\.1,1\.0,10\.0\),r=\\omega\_\{\\text\{syn\}\}r\_\{\\text\{syn\}\}\+\\omega\_\{\\text\{func\}\}r\_\{\\text\{func\}\}\+\\omega\_\{\\text\{ppa\}\}r\_\{\\text\{ppa\}\},\\quad\(\\omega\_\{\\text\{syn\}\},\\omega\_\{\\text\{func\}\},\\omega\_\{\\text\{ppa\}\}\)=\(0\.1,1\.0,10\.0\),\(3\)and evaluation terminating early on failure\.Stage 1 \(syntax\):iverilogcompilation; on failurersyn∈\(0,1\]r\_\{\\text\{syn\}\}\\in\(0,1\]decays with the number of error messages \(detailed scoring in[Appendix˜I](https://arxiv.org/html/2606.05253#A9)\)\.Stage 2 \(functional\): simulation against the reference testbench;rfunc∈\{0,1\}r\_\{\\text\{func\}\}\\in\\\{0,1\\\}\.Stage 3 \(physical synthesis\): Yosys \+ OpenSTA report areaAA\(μm2\\mu\\mathrm\{m\}^\{2\}\), critical\-path delayDD\(ps\) and \(where collected\) powerPP\(μ\\muW\); the PPA reward is the reference\-normalized ratio

rppa=MrefM\(v\),M\(v\)=\{A⋅D⋅P\(RTLLM v2\.0 PPA product,PPA\-product\)A⋅D\(C910 LZA ablation,ADP; no power\)r\_\{\\text\{ppa\}\}=\\frac\{M\_\{\\text\{ref\}\}\}\{M\(v\)\},\\qquad M\(v\)=\\begin\{cases\}A\\cdot D\\cdot P&\\text\{\(RTLLM~v2\.0 PPA product, \}\\mathrm\{PPA\\text\{\-\}product\}\\text\{\)\}\\\\ A\\cdot D&\\text\{\(C910 LZA ablation, \}\\mathrm\{ADP\}\\text\{; no power\)\}\\end\{cases\}\(4\)so thatrppa=1r\_\{\\text\{ppa\}\}=1matches the reference andrppa\>1r\_\{\\text\{ppa\}\}\>1strictly improves it\. The largeωppa\\omega\_\{\\text\{ppa\}\}reflects that PPA optimization is the primary objective once functional correctness is achieved\. After scoring, the top\-kkdeduplicated children of each parent enter𝒮\\mathcal\{S\}, PUCT statistics are updated, and the pool is pruned toCmaxC\_\{\\max\}if exceeded \(values in[Table˜8](https://arxiv.org/html/2606.05253#A6.T8)\)\.

### 3\.4Entropic Advantage Estimation

Following TTT\-Discover\(Yuksekgonulet al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib1)\), we estimate advantages within each*group*\(rollouts sharing a common parent\)\. For a group of rewards\{r1,…,rk\}\\\{r\_\{1\},\\ldots,r\_\{k\}\\\}, ifmax⁡ri−min⁡ri<10−12\\max r\_\{i\}\-\\min r\_\{i\}<10^\{\-12\}the group is degenerate and all advantages are zero; otherwise we findβ∗\\beta^\{\*\}such that the softmax distributionqi∝exp⁡\(β∗ri\)q\_\{i\}\\propto\\exp\(\\beta^\{\*\}r\_\{i\}\)attains a target KL divergence \(*KL budget*\)δ\\deltafrom uniform:

KL\(q∥𝒰\)=∑iqiln⁡\(k⋅qi\)=δ,\\mathrm\{KL\}\(q\\\|\\mathcal\{U\}\)=\\sum\_\{i\}q\_\{i\}\\ln\(k\\cdot q\_\{i\}\)=\\delta,\(5\)solved by binary search\. Smallδ\\deltakeepsqqnear uniform \(diffuse advantages\); largeδ\\deltaallowsqqto peak on a single rollout\. The leave\-one\-out advantage is

Ai=exp⁡\(β∗ri\)1k−1∑j≠iexp⁡\(β∗rj\)−1,A\_\{i\}=\\frac\{\\exp\(\\beta^\{\*\}r\_\{i\}\)\}\{\\frac\{1\}\{k\-1\}\\sum\_\{j\\neq i\}\\exp\(\\beta^\{\*\}r\_\{j\}\)\}\-1,\(6\)which compares each rollout’s exponential reward weight against the average weight of the other rollouts in the same group, yielding a group\-relative contrast while avoiding self\-normalization\. For RTLLM v2\.0 we use the canonical fixedδ=ln⁡2\\delta=\\ln 2;[Section˜3\.7](https://arxiv.org/html/2606.05253#S3.SS7)introduces an adaptive variant we study on the C910 LZA case study \([Section˜4\.3](https://arxiv.org/html/2606.05253#S4.SS3)\)\.

### 3\.5Policy Gradient Update

For rolloutiiwith promptxix\_\{i\}, the policy loss is

ℒ\(𝜽\)=−log⁡π𝜽\(vi\|xi\)⋅Ai,\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)=\-\\log\\pi\_\{\\boldsymbol\{\\theta\}\}\(v\_\{i\}\|x\_\{i\}\)\\cdot A\_\{i\},\(7\)averaged over non\-degenerate rollouts; following TTT\-Discover, we omit the PPO importance\-ratio clip\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.05253#bib.bib7)\)\.

### 3\.6Full Algorithm

Algorithm 1TTT\-RTL0:Design problem

𝒫\\mathcal\{P\}\(spec, testbench, reference\), LLM

π𝜽\\pi\_\{\\boldsymbol\{\\theta\}\}, budget

SS, batch size

BB, rollouts per prompt

nn
1:Initialize pool

𝒮←\{s0\}\\mathcal\{S\}\\leftarrow\\\{s\_\{0\}\\\}\(root\),

T←0T\\leftarrow 0
2:forstep

=1=1to

SSdo

3:Sample

BBparent states via[Equation˜1](https://arxiv.org/html/2606.05253#S3.E1)with lineage blocking

4:Construct prompts

\{xb\}b=1B\\\{x\_\{b\}\\\}\_\{b=1\}^\{B\}\([Section˜3\.2](https://arxiv.org/html/2606.05253#S3.SS2)\)

5:Generate

nnrollouts per prompt using

π𝜽\\pi\_\{\\boldsymbol\{\\theta\}\}via vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.05253#bib.bib25)\)

6:Evaluate rewards via 3\-stage EDA pipeline \([Section˜3\.3](https://arxiv.org/html/2606.05253#S3.SS3)\)

7:Update pool

𝒮\\mathcal\{S\}, PUCT statistics, prune to

CmaxC\_\{\\max\}
8:Compute advantages via entropic

β∗\\beta^\{\*\}at fixed

δ=ln⁡2\\delta=\\ln 2\([Section˜3\.4](https://arxiv.org/html/2606.05253#S3.SS4)\)

9:Update

𝜽\\boldsymbol\{\\theta\}via policy gradient[Equation˜7](https://arxiv.org/html/2606.05253#S3.E7)

10:endfor

11:returnBest\-reward design in

𝒮\\mathcal\{S\}

### 3\.7Adaptive KL\-budget controller

For RTLLM v2\.0 we use the canonical fixedδ=ln⁡2\\delta=\\ln 2\. On the C910 LZA case study \([Section˜4\.3](https://arxiv.org/html/2606.05253#S4.SS3)\) we additionally study an adaptive variant that replaces the constant withδt\\delta\_\{t\}updated once per step from four EMA\-smoothed signals—policy\-vs\-reference KL, effective number of distinct rollouts per group, fraction of constant\-reward groups, and binary\-search saturation rate—combined through a four\-rule priority ladder \(KL brake, winner\-take\-all, stagnation, over\-exploring\) that resets, shrinks, or growsδt\\delta\_\{t\}within\[0\.25ln⁡2,4ln⁡2\]\[0\.25\\ln 2,\\,4\\ln 2\]\. Full equations \([Equation˜8](https://arxiv.org/html/2606.05253#A9.E8)–[Equation˜12](https://arxiv.org/html/2606.05253#A9.E12)\) and hyperparameters \([Table˜9](https://arxiv.org/html/2606.05253#A9.T9)\) are in[Appendix˜I](https://arxiv.org/html/2606.05253#A9)\.

## 4Experiments

### 4\.1Experimental Setup

#### Base model and training configuration\.

The policy model is Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib24)\)\. RTLLM v2\.0 main runs initialize from a lightweight format\-and\-style SFT warm\-up \(ttt\-rtl\-sftstep 18; see[Appendix˜G](https://arxiv.org/html/2606.05253#A7)\); the C910 LZA ablations \([Section˜4\.3](https://arxiv.org/html/2606.05253#S4.SS3)\) instead use the*raw*Qwen3\-8B base model with no SFT, isolating the contribution of test\-time RL on top of an off\-the\-shelf backbone\. Each run is 100 gradient steps withB=4B=4parent states, top\-k=2k=2children, exploration coefficientc=1\.0c=1\.0, and the reward weights of[Equation˜3](https://arxiv.org/html/2606.05253#S3.E3); RTLLM v2\.0 usesn=4n=4rollouts/prompt and C910 LZA ablations usen=8n=8, matching TTT\-Discover’s per\-step budget\(Yuksekgonulet al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib1)\)\. RTLLM v2\.0 uses the canonical fixedδ=ln⁡2\\delta=\\ln 2; C910 ablations varyδ\\deltaalong one axis of[Table˜3](https://arxiv.org/html/2606.05253#S4.T3)\. Full hyperparameters are in[Table˜8](https://arxiv.org/html/2606.05253#A6.T8)\.

#### Benchmark and baselines\.

RTLLM v2\.0\(Luet al\.,[2024](https://arxiv.org/html/2606.05253#bib.bib28); Liuet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib15)\)provides a human\-written reference and testbench for each of 49 designs\. We report the PPA\-product ratioPPA\-product=A⋅D⋅P\\mathrm\{PPA\\text\{\-\}product\}=A\\cdot D\\cdot Pof each method’s best functionally correct design over the v2\.0 reference \(lower is better;1\.01\.0matches the reference\)\. Baselines are three published agent\-based methods that target the same metric:EvolVE\(Hsinet al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib20)\),VeriAgent\(Wanget al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib21)\), and the current SOTAREvolution\(Minet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib22)\); per\-design ratios are taken verbatim from Table 3 ofPinget al\.\([2026](https://arxiv.org/html/2606.05253#bib.bib23)\), where every baseline runs under GPT\-4o\-mini against the v2\.0 reference under the same PPA\-product metric\.

#### Synthesis flow\.

Yosys\(Wolf and Glaser,[2013](https://arxiv.org/html/2606.05253#bib.bib27)\)for logic synthesis, OpenSTA for timing/power\. RTLLM v2\.0 uses Nangate 45 nm typical\-corner \(matching the published baseline flow\); C910 LZA uses Sky130 HD \(the C910 release convention\)\. All comparisons are reference\-normalized within a single PDK; we discuss residual cross\-flow uncertainty in[Appendix˜B](https://arxiv.org/html/2606.05253#A2)\.

#### Compute and seed protocol\.

All runs use A800 GPUs and seed4242, in line with the baselines \(none of EvolVE / VeriAgent / REvolution report seed variance\); a four\-seed paired replication on LZAsimd\_halfand a single\-seed case study on a second C910 unit are reported in[Appendix˜D](https://arxiv.org/html/2606.05253#A4)\.

### 4\.2RTLLM v2\.0: Framework vs Published Agent Baselines

[Table˜1](https://arxiv.org/html/2606.05253#S4.T1)aggregates the per\-design PPA\-product ratios across all4949RTLLM v2\.0 problems; the full per\-design table, including the55designs where at least one method failed to produce compilable, functionally correct RTL, is in[Table˜4](https://arxiv.org/html/2606.05253#A1.T4)\([Appendix˜A](https://arxiv.org/html/2606.05253#A1)\)\. We characterize the comparison along three complementary views, all of which point in the same direction\.\(i\) Coverage:TTT\-RTL produces a valid implementation on48/4948/49designs \(highest of any method evaluated; missing onlyasyn\_fifo\), and is the only method that succeeds onfreq\_divbyfracandfreq\_divbyodd\.\(ii\) Intersection GeoMean \(N=44N\{=\}44\):on the subset where all four methods produced a valid output — the only domain on which GeoMean is mathematically well\-defined for direct comparison — TTT\-RTL achieves a geometric\-mean PPA\-product ratio of0\.3490\.349versus0\.7390\.739for the strongest published baseline \(REvolution\)\.\(iii\) Penalized full\-benchmark GeoMean \(N=49N\{=\}49, failures=1\.0=1\.0\):imputing each method’s failures as ratio1\.01\.0\(“no improvement over the reference”, the most lenient possible imputation since it credits a failed method with matching the reference\) and recomputing GeoMean over the full4949designs gives0\.3410\.341for TTT\-RTL versus0\.7620\.762for REvolution \([Table˜1](https://arxiv.org/html/2606.05253#S4.T1), last column\)\. The intersection metric \(N=44N\{=\}44\) is therefore not a cherry\-pick: any defensible way of folding baseline failures into the comparison only widens the TTT\-RTL margin, never narrows it\.

Table 1:Reference\-normalized external comparison on RTLLM v2\.0\. Each cell reports the PPA\-product ratio \(PPA\-product=A⋅D⋅P\\mathrm\{PPA\\text\{\-\}product\}=A\\cdot D\\cdot P\) of a method’s best functionally correct design over the v2\.0 reference implementation; lower is better\. Baseline ratios are taken from Table 3 ofPinget al\.\([2026](https://arxiv.org/html/2606.05253#bib.bib23)\)\(GPT\-4o\-mini\), while TTT\-RTL is evaluated under our matched Yosys / OpenSTA / Nangate 45 nm flow\.*Coverage*counts designs \(out of4949\) on which a method produced compilable, functionally correct RTL;*GeoMean / ArithMeanN=44*are over the common subset where all four methods completed;*\#Improved*/*\#Best*count common\-subset designs with ratio<1<1/ lowest ratio \(ties split equally\);*GeoMeanpenN=49\{\}\_\{N\{=\}49\}^\{\\mathrm\{pen\}\}*imputes each method’s failures as ratio1\.01\.0\(subset choices and robustness check are detailed below\)\. See[Section˜4\.2](https://arxiv.org/html/2606.05253#S4.SS2)and[Appendix˜B](https://arxiv.org/html/2606.05253#A2)for the flow sanity check\.![Refer to caption](https://arxiv.org/html/2606.05253v1/x2.png)Figure 2:Per\-design PPA\-product ratio \(PPA\-product=A⋅D⋅P\\mathrm\{PPA\\text\{\-\}product\}=A\\cdot D\\cdot P\) on RTLLM v2\.0 \(theN=44N\{=\}44common subset, listed in benchmark order; the55designs with at least one missing method are shown in[Table˜4](https://arxiv.org/html/2606.05253#A1.T4)\)\. The EvolVE / VeriAgent / REvolution per\-design ratios are the GPT\-4o\-mini numbers reported in Table 3 ofPinget al\.\([2026](https://arxiv.org/html/2606.05253#bib.bib23)\); TTT\-RTL ratios are produced under the flow described in[Section˜4](https://arxiv.org/html/2606.05253#S4)\.Top:four\-method bar chart; the dashed black line is the v2\.0 reference \(1\.01\.0\), and the colored dotted lines are each method’sN=44N\{=\}44geometric\-mean ratio\.Bottom:aggregate geometric\-mean PPA\-product ratio per method on theN=44N\{=\}44common subset\. TTT\-RTL is below the reference on38/4438/44designs and is the best method on33\.5/4433\.5/44\(next\-best is REvolution at8/448/44\)\.The advantage is consistent across complexity bins \([Table˜2](https://arxiv.org/html/2606.05253#S4.T2)\), including on small combinational designs where prior methods barely improve\. Full per\-design ratios and the failure breakdown are in[Tables˜4](https://arxiv.org/html/2606.05253#A1.T4)and[A](https://arxiv.org/html/2606.05253#A1)\.

Table 2:Geometric\-mean PPA\-product ratio on RTLLM v2\.0, broken down by design complexity \(binned by reference PPA product\)\. Lower is better; bold marks the best method per row\.Of the five non\-intersection designs, four are ones our flow handles where one or more baselines failed, so excluding them is conservative for TTT\-RTL rather than a cherry\-pick; per\-design failures are listed in[Appendix˜A](https://arxiv.org/html/2606.05253#A1)\. A small number of v2\.0 designs hit the OpenSTA0ps delay\-reporting floor \(PPA product collapses to area\-only\), which affects all four methods uniformly under the shared reference; we flag this as a reward\-shaping limitation in the NeurIPS checklist \(Limitations\)\.

#### Caveat: external systems comparison, not same\-backbone isolation\.

The four methods in[Table˜1](https://arxiv.org/html/2606.05253#S4.T1)share the same v2\.0 reference, the same PPA\-product metric, and \(for our re\-runs\) the same Yosys\+\+OpenSTA flow, so the comparison is reference\-normalized and ratio\-controlled\. The methods do, however, use different policy LLMs: EvolVE, VeriAgent, and REvolution use GPT\-4o\-mini \(as reported inPinget al\.\([2026](https://arxiv.org/html/2606.05253#bib.bib23)\), Table 3\), while TTT\-RTL uses Qwen3\-8B with domain SFT plus per\-design test\-time training\. We therefore read[Table˜1](https://arxiv.org/html/2606.05253#S4.T1)as a*systems*comparison—per\-design test\-time training on top of an open\-source backbone vs\. frozen agentic search on top of a stronger commercial backbone—rather than as a pure algorithmic isolation\. A frozen Best\-of\-NNbaseline on the C910 unit \([Table˜3](https://arxiv.org/html/2606.05253#S4.T3), “Best\-of\-NN” row\), which uses the raw Qwen3\-8B base model with no SFT and no test\-time updates, never produces a functionally correct design within the32003200\-rollout budget, supporting that test\-time updates—not the backbone or the SFT warm\-up—drive the framework’s gains\.

### 4\.3XuanTie C910 LZA: Industrial Case Study

The RTLLM v2\.0 results above establish that TTT\-RTL beats published agent baselines on a textbook benchmark\. To stress\-test the framework on*production silicon*, we now turn to the leading\-zero\-anticipation unitct\_vfmau\_lza\_simd\_halffrom the open\-source XuanTie C910 RISC\-V core\(Chenet al\.,[2020](https://arxiv.org/html/2606.05253#bib.bib26)\)—well\-tuned hand\-written RTL where any ADP reduction is non\-trivial\. We use this single industrial unit to ablate TTT\-RTL along three axes: the reuse strategy, the training\-time advantage estimator, and the KL\-budget schedule\. All runs share the same sampling budget \(100 steps×\\times4 parents×\\timesn=8n=8rollouts=3200=3200rollouts\), base model \(Qwen3\-8B\), technology library \(Sky130\), seed \(4242\), and reward function \([Equation˜3](https://arxiv.org/html/2606.05253#S3.E3)\)\. The reference ADP for this unit is3\.403\.40Mμm2⋅ps\\mu\\mathrm\{m\}^\{2\}\{\\cdot\}\\mathrm\{ps\}, at which the reward is exactly11\.111\.1; any reward above11\.111\.1corresponds to a functionally correct design that strictly improves ADP\.

#### Ablation axes\.

We sweep three orthogonal axes one at a time\. On the*δ\\delta\-schedule*axis, the other two axes are fixed at the full TTT\-RTL configuration \(PUCT reuse, entropic advantage\), and the “adaptiveδ\\delta” row*is*the headline full\-TTT\-RTL row at the top of[Table˜3](https://arxiv.org/html/2606.05253#S4.T3)\. On the*reuse*and*train*axes we instead holdδ\\deltaat the fixedln⁡2\\ln 2TTT\-Discover default rather than at adaptiveδ\\delta, so that the component\-isolation comparisons do not implicitly inherit the controller’s exploratory gain; the framework\-only \(−45\.3%\-45\.3\\%\) row is the natural reference for those rows\. The three axes are then \(i\)reuse∈\\in\{PUCT,ϵ\\epsilon\-greedy \(ϵ=0\.1\\epsilon\{=\}0\.1\), none\}, where “none” restarts every rollout from the empty root; \(ii\)train∈\\in\{entropic advantage, expected reward\}, where the latter is standard GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.05253#bib.bib9)\)with group\-mean\-centred, std\-normalised advantages and no entropic temperature; \(iii\)δ\\delta\-schedule∈\\in\{adaptive \(controller of[Section˜3\.7](https://arxiv.org/html/2606.05253#S3.SS7)\), fixedln⁡2\\ln 2\(TTT\-Discover’s constant\), cosine1\.1→0\.31\.1\\\!\\to\\\!0\.3\(“explore\-then\-exploit”\), cosine0\.3→1\.10\.3\\\!\\to\\\!1\.1\(reverse, sanity\-check\)\}\. Two combined\-weakest configurations bound the ablation from below:*Naive Test\-time RL*\(expected reward \+ no reuse\) and*Best\-of\-NN*\(frozen actor at lr10−1210^\{\-12\},N=3200N\{=\}3200\)\.

Table 3:Ablation results onct\_vfmau\_lza\_simd\_half\(Sky130, Qwen3\-8B, seed4242,32003200\-rollout budget\)\.All rows are single\-seed\.Reward11\.111\.1corresponds to the reference ADP; entries marked “—” never produced any functionally correct design within the budget\. Per\-step reward trajectories are in[Figure˜3](https://arxiv.org/html/2606.05253#S4.F3); KL\-budget trajectories are in[Appendix˜C](https://arxiv.org/html/2606.05253#A3)\. Quantitative gaps within the table should be read as consistent ranking evidence rather than tight effect sizes\.![Refer to caption](https://arxiv.org/html/2606.05253v1/x3.png)Figure 3:Ablation trajectories onct\_vfmau\_lza\_simd\_half, following the three\-panel format ofYuksekgonulet al\.\([2026](https://arxiv.org/html/2606.05253#bib.bib1)\), Fig\. 10\.\(a\) Max reward up to step;\(b\) Mean reward within rollouts in step\(typical policy behaviour\);\(c\) Max reward within rollouts in step\(per\-step exploration quality\)\. The dashed red line marksr=11\.1r=11\.1\(reference ADP\); designs above the line strictly improve over the C910 baseline\. All runs share the same32003200\-rollout budget, base model, reward, and seed; the only differences are the train/reuse components as in[Table˜3](https://arxiv.org/html/2606.05253#S4.T3)\.
#### Reuse and train axes\.

Withδ\\deltaheld at the TTT\-Discover defaultln⁡2\\ln 2, the framework alone \(PUCT \+ entropic \+ fixedδ\\delta\) reaches−45\.3%\-45\.3\\%ADP\. Removing the entropic advantage estimator drops this to−35\.7%\-35\.7\\%\(plateauing after step∼30\\sim\\\!30\), while replacing PUCT withϵ\\epsilon\-greedy reuse closes most of the reuse\-side gap \(−51\.4%\-51\.4\\%at seed 42\)\. Removing reuse entirely is the most costly single\-axis change:−13\.3%\-13\.3\\%\. Naive\-RL \(expected reward \+ no reuse\) and Best\-of\-NNon the frozen policy never produce a single functionally correct design within the32003200\-rollout budget\. The reuse\-vs\-ϵ\\epsilon\-greedy gap is small enough that we read it as “PUCT helps but is not the sole driver”; the entropic\-vs\-expected gap is larger and consistent withYuksekgonulet al\.\([2026](https://arxiv.org/html/2606.05253#bib.bib1)\)’s observation that exploration matters most outside kernel\-engineering settings, where uniform sampling from the base policy almost never yields a correct Verilog module\.

#### δ\\delta\-schedule axis\.

On top of the framework’s−45\.3%\-45\.3\\%, the adaptiveδ\\delta\-controller adds another∼14\\sim\\\!14pp at seed4242\(−59\.4%\-59\.4\\%\)\. The cosine1\.1→0\.31\.1\\\!\\to\\\!0\.3schedule, hand\-tuned to approximate “explore early, exploit late,” recovers most of the gap \(−49\.8%\-49\.8\\%\); the reverse cosine is uniformly worse\. This is consistent with the controller firing P4 \(over\-exploring, growδ\\delta\) when advantages are diffuse early on, and P2/P3 \(winner\-take\-all / stagnation, shrinkδ\\delta\) once a winner emerges—i\.e\. the trajectory resembles the explore\-then\-exploit shape that the cosine encodes by hand\. Per\-step KL\-budget trajectories appear in[Appendix˜C](https://arxiv.org/html/2606.05253#A3)\. A four\-seed paired replication on LZAsimd\_halfand a single\-seed case study on a second C910 unit \([Appendix˜D](https://arxiv.org/html/2606.05253#A4)\) confirm task\-responsive direction and a∼2\.6×\{\\sim\}2\.6\\timesreduction in seed\-wise variance, but the gap does not reachp<0\.05p<0\.05atn=4n=4; we discuss this caveat further in[Appendix˜D](https://arxiv.org/html/2606.05253#A4)\.

#### Mean vs\. max reward\.

[Figure˜3](https://arxiv.org/html/2606.05253#S4.F3)b,c isolates the entropic advantage estimator: expected\-reward \+ PUCT lifts the per\-step*mean*reward but caps the per\-step*max*below TTT\-RTL’s, trading a slightly worse average rollout for a consistently better best rollout—the quantity that matters in a discovery setting\.

## 5Conclusion

We presented TTT\-RTL, a framework that closes the loop between an LLM policy and EDA synthesis tools to apply test\-time training to RTL optimization\. On RTLLM v2\.0 it cuts the PPA product against the strongest published agent baseline, and on a XuanTie C910 industrial LZA unit it improves over well\-tuned production RTL while component ablations isolate the contribution of the PUCT state pool and the entropic advantage estimator\.

## References

- T\. Ajayi, V\. A\. Chhabria, M\. Fogaça, S\. Hashemi, A\. Hosny, A\. B\. Kahng, M\. Kim, J\. Lee, U\. Mallappa, M\. Neseem, G\. Pradipta, S\. Reda, M\. Saligane, S\. S\. Sapatnekar, C\. Sechen, M\. Shalan, W\. Swartz, L\. Wang, M\. Woo, and B\. Xu \(2019\)Toward an open\-source digital flow: first learnings from the OpenROAD project\.InProceedings of the 56th Annual Design Automation Conference \(DAC\),External Links:[Document](https://dx.doi.org/10.1145/3316781.3326334)Cited by:[item 2](https://arxiv.org/html/2606.05253#Ax1.I1.i2.p1.10)\.
- Chip\-chat: challenges and opportunities in conversational hardware design\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2305.13243),[Link](https://arxiv.org/abs/2305.13243),2305\.13243Cited by:[§1](https://arxiv.org/html/2606.05253#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px1.p1.1)\.
- C\. Chen, X\. Xiang, C\. Liu, Y\. Shang, R\. Guo, D\. Liu, Y\. Lu, Z\. Hao, J\. Luo, Z\. Chen, C\. Li, Y\. Pu, J\. Meng, X\. Yan, Y\. Xie, and X\. Qi \(2020\)XuanTie\-910: a commercial multi\-core 12\-stage pipeline out\-of\-order 64\-bit high performance RISC\-V processor with vector extension\.InProceedings of the 47th International Symposium on Computer Architecture \(ISCA\),pp\. 52–64\.External Links:[Document](https://dx.doi.org/10.1109/ISCA45697.2020.00016)Cited by:[item 12](https://arxiv.org/html/2606.05253#Ax1.I1.i12.p1.1),[item 5](https://arxiv.org/html/2606.05253#Ax1.I1.i5.p1.1),[§4\.3](https://arxiv.org/html/2606.05253#S4.SS3.p1.9)\.
- Z\. Chen, K\. Chang, Z\. Li, C\. Li, X\. He, C\. Chen, M\. Wang, H\. Xu, Y\. Han, H\. Li, and Y\. Wang \(2025\)ChipSeek: optimizing Verilog generation via EDA\-integrated reinforcement learning\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2507.04736),[Link](https://arxiv.org/abs/2507.04736),2507\.04736Cited by:[§1](https://arxiv.org/html/2606.05253#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px3.p1.1)\.
- W\. Hsin, R\. Deng, Y\. Hsieh, E\. Huang, and S\. Hung \(2026\)EvolVE: evolutionary search for LLM\-based Verilog generation and optimization\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2601.18067),[Link](https://arxiv.org/abs/2601.18067),2601\.18067Cited by:[§1](https://arxiv.org/html/2606.05253#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.05253#S4.SS1.SSS0.Px2.p1.2)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with PagedAttention\.InProceedings of the 29th Symposium on Operating Systems Principles \(SOSP\),External Links:[Document](https://dx.doi.org/10.1145/3600006.3613165),[Link](https://arxiv.org/abs/2309.06180),2309\.06180Cited by:[5](https://arxiv.org/html/2606.05253#alg1.l5)\.
- H\. Le, Y\. Wang, A\. D\. Gotmare, S\. Savarese, and S\. C\. H\. Hoi \(2022\)CodeRL: mastering code generation through pretrained models and deep reinforcement learning\.InAdvances in Neural Information Processing Systems 35 \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2207.01780),2207\.01780Cited by:[§2\.1](https://arxiv.org/html/2606.05253#S2.SS1.p1.1)\.
- M\. Liu, T\. Ene, R\. Kirby, C\. Cheng, N\. Pinckney, R\. Liang, J\. Alben, H\. Anand, S\. Banerjee, I\. Bayraktaroglu, B\. Catanzaro, A\. Chaudhuri, B\. Khailany, H\. Ren,et al\.\(2023a\)ChipNeMo: domain\-adapted LLMs for chip design\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2311.00176),[Link](https://arxiv.org/abs/2311.00176),2311\.00176Cited by:[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px1.p1.1)\.
- M\. Liu, N\. Pinckney, B\. Khailany, and H\. Ren \(2023b\)VerilogEval: evaluating large language models for verilog code generation\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2309.07544),[Link](https://arxiv.org/abs/2309.07544),2309\.07544Cited by:[§1](https://arxiv.org/html/2606.05253#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px1.p1.1)\.
- S\. Liu, W\. Fang, Y\. Lu, Q\. Zhang, H\. Zhang, and Z\. Xie \(2023c\)RTLCoder: outperforming GPT\-3\.5 in design RTL generation with our open\-source dataset and lightweight solution\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2312.08617),[Link](https://arxiv.org/abs/2312.08617),2312\.08617Cited by:[§1](https://arxiv.org/html/2606.05253#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px1.p1.1)\.
- S\. Liu, Y\. Lu, W\. Fang, M\. Li, and Z\. Xie \(2025\)OpenLLM\-RTL: open dataset and benchmark for LLM\-aided design RTL generation\.arXiv\.External Links:[Link](https://arxiv.org/abs/2503.15112),2503\.15112Cited by:[item 12](https://arxiv.org/html/2606.05253#Ax1.I1.i12.p1.1),[item 5](https://arxiv.org/html/2606.05253#Ax1.I1.i5.p1.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.05253#S4.SS1.SSS0.Px2.p1.2)\.
- Y\. Lu, S\. Liu, Q\. Zhang, and Z\. Xie \(2024\)RTLLM: an open\-source benchmark for design RTL generation with large language model\.InProceedings of the 29th Asia and South Pacific Design Automation Conference \(ASP\-DAC\),External Links:[Document](https://dx.doi.org/10.1109/ASP-DAC58780.2024.10473904),[Link](https://arxiv.org/abs/2308.05345),2308\.05345Cited by:[item 12](https://arxiv.org/html/2606.05253#Ax1.I1.i12.p1.1),[item 5](https://arxiv.org/html/2606.05253#Ax1.I1.i5.p1.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.05253#S4.SS1.SSS0.Px2.p1.2)\.
- K\. Min, K\. Cho, J\. Jang, and S\. Kang \(2025\)REvolution: an evolutionary framework for RTL generation driven by large language models\.arXiv\.Note:Accepted to ASP\-DAC 2026External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2510.21407),[Link](https://arxiv.org/abs/2510.21407),2510\.21407Cited by:[§1](https://arxiv.org/html/2606.05253#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.05253#S4.SS1.SSS0.Px2.p1.2)\.
- Z\. Pei, H\. Zhen, M\. Yuan, Y\. Huang, and B\. Yu \(2024\)BetterV: controlled Verilog generation with discriminative guidance\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.235\.External Links:[Link](https://arxiv.org/abs/2402.03375),2402\.03375Cited by:[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px1.p1.1)\.
- H\. Ping, P\. Zhang, S\. Li, W\. Yang, A\. Cheng, S\. Duan, X\. Zhang, and P\. Bogdan \(2026\)COEVO: co\-evolutionary framework for joint functional correctness and PPA optimization in LLM\-based RTL generation\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2604.15001),[Link](https://arxiv.org/abs/2604.15001),2604\.15001Cited by:[Table 4](https://arxiv.org/html/2606.05253#A1.T4),[Table 4](https://arxiv.org/html/2606.05253#A1.T4.2.1),[Table 5](https://arxiv.org/html/2606.05253#A2.T5),[Table 5](https://arxiv.org/html/2606.05253#A2.T5.8.4),[Appendix B](https://arxiv.org/html/2606.05253#A2.p1.3),[item 12](https://arxiv.org/html/2606.05253#Ax1.I1.i12.p1.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px2.p1.1),[Figure 2](https://arxiv.org/html/2606.05253#S4.F2),[Figure 2](https://arxiv.org/html/2606.05253#S4.F2.18.9),[§4\.1](https://arxiv.org/html/2606.05253#S4.SS1.SSS0.Px2.p1.2),[§4\.2](https://arxiv.org/html/2606.05253#S4.SS2.SSS0.Px1.p1.4),[Table 1](https://arxiv.org/html/2606.05253#S4.T1),[Table 1](https://arxiv.org/html/2606.05253#S4.T1.12.6)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.CoRRabs/1707\.06347\.External Links:[Link](http://arxiv.org/abs/1707.06347),1707\.06347Cited by:[§3\.5](https://arxiv.org/html/2606.05253#S3.SS5.p1.3)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2402.03300),[Link](https://arxiv.org/abs/2402.03300),2402\.03300Cited by:[§2\.1](https://arxiv.org/html/2606.05253#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.05253#S4.SS3.SSS0.Px1.p1.18)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2024\)HybridFlow: a flexible and efficient RLHF framework\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2409.19256),[Link](https://arxiv.org/abs/2409.19256),2409\.19256Cited by:[Table 8](https://arxiv.org/html/2606.05253#A6.T8.11.19.7.2.1.1),[item 12](https://arxiv.org/html/2606.05253#Ax1.I1.i12.p1.1),[item 2](https://arxiv.org/html/2606.05253#S1.I1.i2.p1.1)\.
- J\. Shi, Z\. Gao, C\. Ko, and D\. Boning \(2025\)EARL: entropy\-aware RL alignment of LLMs for reliable RTL code generation\.arXiv\.External Links:[Link](https://arxiv.org/abs/2511.12033),2511\.12033Cited by:[§1](https://arxiv.org/html/2606.05253#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px3.p1.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2408.03314),[Link](https://arxiv.org/abs/2408.03314),2408\.03314Cited by:[§2\.1](https://arxiv.org/html/2606.05253#S2.SS1.p1.1)\.
- Y\. Sun, X\. Wang, Z\. Liu, J\. Miller, A\. A\. Efros, and M\. Hardt \(2020\)Test\-time training with self\-supervision for generalization under distribution shifts\.InProceedings of the 37th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.119,pp\. 9229–9248\.Cited by:[§2\.1](https://arxiv.org/html/2606.05253#S2.SS1.p1.1)\.
- Y\. Wang, Q\. Shi, S\. Li, Q\. Hu, X\. Yin, B\. Guo, X\. Han, M\. Sun, and J\. Su \(2026\)VeriAgent: a tool\-integrated multi\-agent system with evolving memory for PPA\-aware RTL code generation\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2603.17613),[Link](https://arxiv.org/abs/2603.17613),2603\.17613Cited by:[§1](https://arxiv.org/html/2606.05253#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.05253#S2.SS2.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.05253#S4.SS1.SSS0.Px2.p1.2)\.
- C\. Wolf and J\. Glaser \(2013\)Yosys – a free verilog synthesis suite\.InProceedings of the 21st Austrian Workshop on Microelectronics \(Austrochip\),Note:Post\-2014, the first author goes by Claire WolfCited by:[item 12](https://arxiv.org/html/2606.05253#Ax1.I1.i12.p1.1),[§4\.1](https://arxiv.org/html/2606.05253#S4.SS1.SSS0.Px3.p1.1)\.
- Y\. Wu, Z\. Sun, S\. Li, S\. Welleck, and Y\. Yang \(2024\)Inference scaling laws: an empirical analysis of compute\-optimal inference for problem\-solving with language models\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2408.00724),[Link](https://arxiv.org/abs/2408.00724),2408\.00724Cited by:[§2\.1](https://arxiv.org/html/2606.05253#S2.SS1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, J\. Lin, J\. Zhou,et al\.\(2025\)Qwen3 technical report\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2505.09388),[Link](https://arxiv.org/abs/2505.09388),2505\.09388Cited by:[item 12](https://arxiv.org/html/2606.05253#Ax1.I1.i12.p1.1),[§4\.1](https://arxiv.org/html/2606.05253#S4.SS1.SSS0.Px1.p1.7)\.
- M\. Yuksekgonul, D\. Koceja, X\. Li, F\. Bianchi, J\. McCaleb, X\. Wang, J\. Kautz, Y\. Choi, J\. Zou, C\. Guestrin, and Y\. Sun \(2026\)Learning to discover at test time\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2601.16175),[Link](https://arxiv.org/abs/2601.16175),2601\.16175Cited by:[Appendix I](https://arxiv.org/html/2606.05253#A9.SS0.SSS0.Px2.p1.8),[item 3](https://arxiv.org/html/2606.05253#Ax1.I1.i3.p1.1),[§2\.1](https://arxiv.org/html/2606.05253#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.05253#S3.SS1.p1.23),[§3\.4](https://arxiv.org/html/2606.05253#S3.SS4.p1.5),[Figure 3](https://arxiv.org/html/2606.05253#S4.F3),[Figure 3](https://arxiv.org/html/2606.05253#S4.F3.4.2),[§4\.1](https://arxiv.org/html/2606.05253#S4.SS1.SSS0.Px1.p1.7),[§4\.3](https://arxiv.org/html/2606.05253#S4.SS3.SSS0.Px2.p1.12)\.
- Y\. Zhu and contributors \(2025\)CodeV\-R1: a verilog reasoning\-distillation dataset\.Note:Hugging Face datasetExternal Links:[Link](https://huggingface.co/datasets/zhuyaoyu/CodeV-R1-dataset)Cited by:[Appendix G](https://arxiv.org/html/2606.05253#A7.SS0.SSS0.Px1.p1.2)\.
- Y\. Zuo, K\. Zhang, L\. Sheng, S\. Qu, G\. Cui, X\. Zhu, H\. Li, Y\. Zhang, X\. Long, E\. Hua, B\. Qi, Y\. Sun, Z\. Ma, L\. Yuan, N\. Ding, and B\. Zhou \(2025\)TTRL: test\-time reinforcement learning\.InAdvances in Neural Information Processing Systems 38 \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2504.16084),2504\.16084Cited by:[§2\.1](https://arxiv.org/html/2606.05253#S2.SS1.p1.1)\.

## Appendix APer\-design Results on RTLLM v2\.0

[Table˜4](https://arxiv.org/html/2606.05253#A1.T4)reports the per\-design ratio of every method on every RTLLM v2\.0 problem \(all4949designs\)\. “–” indicates that the method failed to produce a compilable / functionally correct design within its reported budget\. The TTT\-RTL column is the PPA\-product ratio \(A⋅D⋅PA\\cdot D\\cdot P\) that drove its training reward; the EvolVE / VeriAgent / REvolution columns are the PPA\-product ratios reported by the original authors against the same v2\.0 reference \(see “Caveat: external systems comparison, not same\-backbone isolation” in[Section˜4\.2](https://arxiv.org/html/2606.05253#S4.SS2)\)\. The summary statistics \([Table˜1](https://arxiv.org/html/2606.05253#S4.T1)in the main text\) are computed over theN=44N\{=\}44common subset where all four methods completed \(the intersection required for a fair GeoMean\); the five remaining designs \(asyn\_fifo,radix2\_div,freq\_divbyeven,freq\_divbyfrac,freq\_divbyodd\) are excluded from GeoMean / \#Best but are shown in the per\-design table below for completeness\.

Table 4:Per\-design ratio on RTLLM v2\.0 \(lower is better; bold = best per row; “–” = method failed to produce compilable RTL\)\. All four methods report the PPA\-product ratioA⋅D⋅PA\{\\cdot\}D\{\\cdot\}Pagainst the same v2\.0 reference; EvolVE / VeriAgent / REvolution numbers are taken fromPinget al\.\[[2026](https://arxiv.org/html/2606.05253#bib.bib23)\], Table 3 \(GPT\-4o\-mini\)\.
## Appendix BFlow Sanity Check: Local vs\. Published Reference PPA

To support the ratio\-normalized comparison protocol described in[Section˜4\.2](https://arxiv.org/html/2606.05253#S4.SS2)\(“Flow sanity check and ratio\-normalized comparison”\), we re\-evaluated the RTLLM v2\.0 reference Verilog on every design under our local Yosys\+\+OpenSTA flow with the Nangate 45 nm typical\-corner library, and compared the resulting area / delay / power against the published reference numbers reported in Table 3 ofPinget al\.\[[2026](https://arxiv.org/html/2606.05253#bib.bib23)\]\(the open\-source pipeline released ashping666/COEVO\)\.[Table˜5](https://arxiv.org/html/2606.05253#A2.T5)shows representative rows spanning the four complexity bins of[Table˜2](https://arxiv.org/html/2606.05253#S4.T2);[Table˜6](https://arxiv.org/html/2606.05253#A2.T6)reports aggregate error statistics over the designs that synthesize cleanly under both flows \(4747for area,4343for delay/power; see the table caption\)\.

Table 5:Representative per\-design comparison of the v2\.0 reference RTL under our local flow vs\. the published numbers\[Pinget al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib23)\]\. Area inμm2\\mu\\mathrm\{m\}^\{2\}, delay in ns, power inμ\\muW; theΔ\\Deltacolumns are signed relative error\(ours−pub\)/pub\(\\text\{ours\}\-\\text\{pub\}\)/\\text\{pub\}\. Area and delay are closely matched on the vast majority of designs; absolute power shows larger discrepancies driven by OpenSTA activity / reporting assumptions, which is the reason we only report ratios against the shared reference in the main comparison\. Two outlier\-style cases \(RAMandasyn\_fifo, where Yosys infers different array / synchronizer structures across flows\) are included for completeness\.DesignApubA\_\{\\text\{pub\}\}AoursA\_\{\\text\{ours\}\}ΔA\\Delta ADpubD\_\{\\text\{pub\}\}DoursD\_\{\\text\{ours\}\}ΔD\\Delta DPpubP\_\{\\text\{pub\}\}PoursP\_\{\\text\{ours\}\}ΔP\\Delta Pcomparator\_3bit12\.012\.0\+0\.0%\+0\.0\\%0\.150\.16\+6\.7%\+6\.7\\%5\.74\.7−18\.3%\-18\.3\\%adder\_8bit48\.948\.9\+0\.0%\+0\.0\\%0\.310\.34\+9\.7%\+9\.7\\%29\.016\.8−42\.1%\-42\.1\\%calendar164\.1164\.9\+0\.5%\+0\.5\\%0\.420\.47\+11\.9%\+11\.9\\%136\.0105\.0−22\.8%\-22\.8\\%adder\_16bit96\.894\.7−2\.2%\-2\.2\\%0\.590\.67\+13\.6%\+13\.6\\%61\.540\.4−34\.3%\-34\.3\\%adder\_32bit208\.5211\.2\+1\.3%\+1\.3\\%0\.870\.91\+4\.6%\+4\.6\\%138\.085\.1−38\.3%\-38\.3\\%alu1953\.01967\.1\+0\.7%\+0\.7\\%1\.932\.10\+8\.8%\+8\.8\\%847\.0688\.0−18\.8%\-18\.8\\%multi\_16bit951\.5933\.9−1\.8%\-1\.8\\%1\.982\.19\+10\.6%\+10\.6\\%1250\.0796\.0−36\.3%\-36\.3\\%adder\_pipe\_64bit2529\.42523\.5−0\.2%\-0\.2\\%0\.830\.90\+8\.4%\+8\.4\\%2910\.01920\.0−34\.0%\-34\.0\\%pe3649\.53640\.2−0\.3%\-0\.3\\%1\.721\.91\+11\.0%\+11\.0\\%35700\.01400\.0−96\.1%\-96\.1\\%*Outlier\-style cases \(different inferred structure across flows\)*RAM474\.8633\.9\+33\.5%\+33\.5\\%0\.220\.32\+45\.5%\+45\.5\\%607\.0479\.0−21\.1%\-21\.1\\%asyn\_fifo1099\.91341\.7\+22\.0%\+22\.0\\%0\.330\.87\+163\.6%\+163\.6\\%1600\.01140\.0−28\.7%\-28\.7\\%Table 6:Aggregate flow\-sanity statistics over the RTLLM v2\.0 designs that synthesize cleanly under both flows\. The*Area*row covers all4747designs that complete area synthesis under both flows; the*Delay*and*Power*rows are restricted to the4343\-design subset for which both flows also produce a valid OpenSTA timing report \(designs where one flow returns a delay\-floor or missing\-power report are excluded from those two rows only\)\. Error is signed relative error of our local measurement against the published value;*within\-X%X\\%*counts designs whose absolute relative error is at mostXX\. The takeaway is that the PPA product’s area and delay components are tightly matched, and the residual cross\-flow discrepancy lives almost entirely in absolute power—which the main\-result protocol absorbs by reporting only ratios against the shared v2\.0 reference\.The full per\-design CSV \(reference\_ppa\_measured\.csv\) is released alongside the codebase\.

## Appendix CKL\-Budget Ablation Trajectories

![Refer to caption](https://arxiv.org/html/2606.05253v1/x4.png)Figure 4:KL\-budget ablation trajectories onct\_vfmau\_lza\_simd\_half, in the same three\-panel layout as[Figure˜3](https://arxiv.org/html/2606.05253#S4.F3)\. Adaptiveδ\\delta\(dark blue, RTL\-Discover\) keeps a strictly higher upper\-tail reward than all three alternatives throughout training \(panel c\) and is the only strategy whose best\-so\-far reward \(panel a\) crossesr=20r=20\. The cosine1\.1→0\.31\.1\\\!\\to\\\!0\.3schedule recovers a substantial fraction of the gap, supporting the interpretation that early\-large / late\-smallδ\\deltais a useful inductive bias for RTL search; the reverse schedule is uniformly worse\. The fixed Discover\-styleδ=ln⁡2\\delta=\\ln 2falls in between but plateaus before reaching the reference\-ADP line on \(a\)\. Headline numbers are in[Table˜3](https://arxiv.org/html/2606.05253#S4.T3)\([Section˜4\.3](https://arxiv.org/html/2606.05253#S4.SS3)\)\.
## Appendix DMulti\-Seed and Multi\-Task Replication of the Adaptive KL\-Budget Controller

[Section˜4\.3](https://arxiv.org/html/2606.05253#S4.SS3)reports the adaptive vs\. fixedδ=ln⁡2\\delta=\\ln 2contrast at a single seed onct\_vfmau\_lza\_simd\_half\. We extend that contrast in two directions: \(i\) a four\-seed paired replication on the same problem, and \(ii\) a single\-seed case study on a second C910 unit,ct\_vfdsu\_fadd\_close\_s0\_d\. Both arms share the recipe of[Section˜4\.3](https://arxiv.org/html/2606.05253#S4.SS3)\(PUCT \+ entropic, sky130, KL penalty active,100100RL steps,44parents×8\\times\\,8rollouts\) and differ only in the algorithmic switchalgorithm\.ttt\_entropic\_kl\_budget\_mode\. Raw extracts, statistics scripts, and figures are released underexperiment/checklist/multi\-seed\-c910/\.

#### LZAsimd\_half\(n = 4 paired\)\.

[Table˜7](https://arxiv.org/html/2606.05253#A4.T7)shows the four headline metrics paired across seeds\. No metric reachesp<0\.05p<0\.05under either pairedtt\-test or Wilcoxon atn=4n=4; the unpaired Welch test onbest\_reward\_everreturnsp=0\.51p=0\.51\. The mean advantage of\+0\.67\+0\.67onbest\_reward\_everis robust to dropping any single seed \(leave\-one\-outΔ∈\[\+0\.12,\+0\.95\]\\Delta\\in\[\+0\.12,\+0\.95\], all positive\), but the gap remains within seed\-wise noise\. Adaptive does, however, reduce seed\-wise standard deviation by∼2\.6×\\sim\\\!2\.6\\times\(0\.560\.56vs\.1\.481\.48\) and reaches score≥20\\geq 20on4/44/4seeds while fixed reaches it on3/43/4\. Across all four adaptive seeds the controller drivesδ\\delta*below*ln⁡2\\ln 2, with per\-seed minima in\[0\.17,0\.50\]\[0\.17,0\.50\]\([Figure˜5](https://arxiv.org/html/2606.05253#A4.F5), top\)\.

Table 7:Paired multi\-seed comparison onct\_vfmau\_lza\_simd\_half\(sky130,n=4n=4adaptive vs\.44fixed,100100RL steps each\)\. Mean±\\pmstd across seeds;pp\-values are pairedtt\-test\. No metric reachesp<0\.05p<0\.05\.![Refer to caption](https://arxiv.org/html/2606.05253v1/figures/multiseed_c910/fig1_delta_trajectory.png)

![Refer to caption](https://arxiv.org/html/2606.05253v1/figures/multiseed_c910/fig2_per_seed_peak.png)

Figure 5:LZAsimd\_half,n=4n=4paired\.Left:per\-stepδt\\delta\_\{t\}for all four adaptive seeds and the fixedδ=ln⁡2\\delta=\\ln 2baseline\. Adaptive consistently drivesδ\\deltabelowln⁡2\\ln 2across seeds\.Right:per\-seedbest\_reward\_everbars \(paired\)\. Adaptive wins3/43/4but the gap is within seed\-wise noise\.The four adaptive runs were not originally generated as seeds11–44; they are historical runs treated as a four\-seed pool for this analysis\. An exhaustive enumeration of all2424adaptive\-to\-fixed permutations yields paired\-ttp∈\[0\.31,0\.59\]p\\in\[0\.31,0\.59\]with0/240/24significant atp<0\.10p<0\.10, so the headline conclusion is invariant to pairing choice\.

#### fadd\_close\_s0\_d\(n=1n=1\)\.

A single\-seed case study on a second C910 unit shows a qualitatively different picture: the fixed arm plateaus atbest\_reward=9\.869\.86from step∼20\\sim\\\!20onwards and never crosses1010, while the adaptive arm reaches12\.5512\.55\(peak\) and12\.3312\.33\(end\-of\-run mean over the last2020steps\), a\+27%\+27\\%/\+34%\+34\\%relative gap\. Crucially, the controller*raises*δ\\deltaaboveln⁡2\\ln 2on this problem \(to1\.441\.44, see[Figure˜6](https://arxiv.org/html/2606.05253#A4.F6), right\) – the opposite direction from LZA – demonstrating that the four EMA signals of[Section˜3\.4](https://arxiv.org/html/2606.05253#S3.SS4)respond to task\-specific structure rather than moving in a fixed direction\.

![Refer to caption](https://arxiv.org/html/2606.05253v1/figures/multiseed_c910/fig2_running_max_score.png)

![Refer to caption](https://arxiv.org/html/2606.05253v1/figures/multiseed_c910/fig3_delta_trajectory.png)

Figure 6:fadd\_close\_s0\_d,n=1n=1vs\.11\.Left:running\-max best\-in\-batch reward\. Fixed plateaus at9\.869\.86; adaptive escapes around step3030and stabilises near12\.512\.5\.Right:δt\\delta\_\{t\}trajectory\. Fixed stays atln⁡2≈0\.693\\ln 2\\approx 0\.693; adaptive raisesδ\\deltato1\.441\.44at step∼50\\sim\\\!50,*after*the policy breakthrough\.
#### What this experiment supports\.

\(1\) The adaptive controller is task\-responsive in direction – it lowersδ\\deltaon LZA and raises it on fadd\. \(2\) Adaptive matches or exceeds fixed on every aggregate metric we report \(\+0\.67\+0\.67peak and−25\-25steps to score≥20\\geq 20on LZA;\+2\.69\+2\.69peak and\+3\.11\+3\.11end\-of\-run on fadd\)\. \(3\) Seed\-wise variance on LZA is reduced∼2\.6×\\sim\\\!2\.6\\timesand the worst\-case fixed seed \(peak18\.7518\.75, fails to cross score2020\) has no adaptive counterpart\.

#### What this experiment does*not*support\.

\(1\) Statistical significance: no metric reachesp<0\.05p<0\.05on LZA atn=4n=4, and fadd isn=1n=1and admits no test\. \(2\) Causal attribution on fadd: theδ\\deltaraise occurs at step∼50\\sim\\\!50, after the policy has already plateaued at the new peak near step3030\. In the first5050steps both arms run withδ=ln⁡2\\delta=\\ln 2and the divergence is from PUCT / vLLM stochasticity, not the scheduler\. \(3\) A peak\-performance claim: the LZA\+0\.67\+0\.67falls inside the\[15\.6,25\.7\]\[15\.6,25\.7\]development\-time spread reported in[Section˜5](https://arxiv.org/html/2606.05253#S5), and fadd is a single trial\.

We accordingly frame the adaptive controller in the body of the paper as a robustness and task\-responsiveness observation rather than a peak\-performance win\. Re\-running fadd atn≥4n\\geq 4and extending LZA ton≥8n\\geq 8remain the obvious follow\-ups\.

## Appendix EC910 LZA Case Study: Baseline, Optimized Code, and Testbench

This appendix documents the concrete artifact behind the C910 numbers in[Sections˜4\.3](https://arxiv.org/html/2606.05253#S4.SS3),[4\.3](https://arxiv.org/html/2606.05253#S4.SS3)and[D](https://arxiv.org/html/2606.05253#A4): the baseline RTL we start from, the best discovered candidate \(best\_reward\_ever=25\.7225\.72,A⋅D=1\.38×106A\{\\cdot\}D=1\.38\{\\times\}10^\{6\}vs\. baseline3\.40×1063\.40\{\\times\}10^\{6\}, i\.e\.−59\.4%\-59\.4\\%ADP at sky130 HD\), and the verification harness that gates every rollout\.

### E\.1Baseline: Xuantie OpenC910ct\_vfmau\_lza\_simd\_half

The baseline is the upstream RTL from the open\-sourced Xuantie C910 core \(ct\_vfmau\_lza\_simd\_half\.v\), which implements a2424\-bit leading\-zero anticipator for the SIMD half\-precision lane of the VFMAU\. The module has two stages:

#### Stage 1: pre\-encode\.

Carry signalsP=summand⊕addendP\{=\}\\mathrm\{summand\}\\oplus\\mathrm\{addend\},G=summand&addendG\{=\}\\mathrm\{summand\}\\,\\&\\,\\mathrm\{addend\},D=summand\|addend¯D\{=\}\\overline\{\\mathrm\{summand\}\\,\|\\,\\mathrm\{addend\}\}are formed bitwise, followed by a vectorized2424\-bit pre\-decodelza\_precod\[23:0\]with three boundary cases \(LSB, MSB, and the bulk\[22:1\]\[22\{:\}1\]slice gated onsub\_vld\)\. Both arms keep this stage byte\-for\-byte identical; the optimization budget is entirely in stage 2\.

#### Stage 2 \(baseline\): flat2424\-waycasezpriority encoder\.

The leading\-one position is decoded by a singlealways @\(\*\) casezblock with2424one\-hot patterns of the form24’b1???…→\\to5’d0,24’b01??…→\\to5’d1,…\\dots,24’b…001→\\to5’d23, plus a default5’d24\. Yosys synthesizes this into a2424\-input priority chain whose critical path traverses all2424casezarms before the55\-bit encoder; together with the \(already large\) pre\-encode this is what gives the baseline ADP of3\.40×1063\.40\{\\times\}10^\{6\}\.

### E\.2Best discovered candidate

The best rollout under our adaptive\-δ\\deltarecipe keeps stage 1 unchanged and rewrites stage 2 as a*hierarchical44\-bit grouped*priority encoder \(LABEL:lst:c910\_best\)\. Three changes drive the ADP reduction:

1. 1\.Group\-then\-encode\.lza\_precod\[23:0\]is sliced into six44\-bit groups \(\{23:20\},\{19:16\},…,\{3:0\}\\\{23\{:\}20\\\},\\\{19\{:\}16\\\},\\dots,\\\{3\{:\}0\\\}\)\. Within a group the44\-way priority over a44\-bit one\-hot vector collapses to a depth\-33ternary cascadea\[3\]? k: a\[2\]? k\+1: a\[1\]? k\+2: a\[0\]? k\+3: 5’d24, which Yosys maps to four22\-input gates per group instead of the shared2424\-priority chain\.
2. 2\.Group\-validity reuse\.Six instances of the existing C910 sub\-cellct\_vfmau\_lza\_42\(a 4:2 LZA compressor already shipped in thedeps/directory\) compute alza\_vldflag per group; the outer cascade picks the highest\-index valid group, so each44\-bit decoder fires only when its group is selected\. Thelza\_p0/lza\_p1outputs of the compressor are deliberately left unconnected — the policy reuses the cell only for its valid bit\. This converts the baseline’s monolithic priority chain into alog4⁡\(24\)\\log\_\{4\}\(24\)\-style two\-level structure, which is the dominant source of the delay reduction \(2191→1315ps2191\\to 1315\\,\\mathrm\{ps\},−40%\-40\\%\)\.
3. 3\.Combinational, not registered\.The baseline declaresreg \[4:0\] lza\_resultand drives it from analways @\(\*\)block; the candidate uses a singleassignwith a nested ternary, which is functionally equivalent \(no clock anywhere in the spec\) but lets Yosys schedule the decoder under one cone of logic with the group\-validity lookups, trimming both area \(1553→1051μm21553\\to 1051\\,\\mu\\mathrm\{m\}^\{2\},−32%\-32\\%\) and the fan\-in to the final mux\.

The interface \(port list, widths, directions\) is preserved exactly, so the candidate is a drop\-in replacement\. This module\-internal restructuring is the only change in stage 2\.

Listing 1:Stage 2 of the best candidate \(ttt\_c910\_state\_pool/latest\.json, state \#304,A⋅D=1\.38×106A\{\\cdot\}D=1\.38\{\\times\}10^\{6\}\)\. Stage 1 \(carry\_p/g/d,lza\_precod\) is unchanged from the baseline and elided\.wire\[3:0\]lza\_4\_23\_20=lza\_precod\[23:20\];

wire\[3:0\]lza\_4\_19\_16=lza\_precod\[19:16\];

wire\[3:0\]lza\_4\_15\_12=lza\_precod\[15:12\];

wire\[3:0\]lza\_4\_11\_8=lza\_precod\[11:8\];

wire\[3:0\]lza\_4\_7\_4=lza\_precod\[7:4\];

wire\[3:0\]lza\_4\_3\_0=lza\_precod\[3:0\];

ct\_vfmau\_lza\_42lza\_42\_23\_20\(\.lza\_precod\(lza\_4\_23\_20\),

\.lza\_p0\(\),\.lza\_p1\(\),\.lza\_vld\(\)\);

assignlza\_result=

\(lza\_42\_23\_20\.lza\_vld\)?

\(lza\_4\_23\_20\[3\]?5’d0:lza\_4\_23\_20\[2\]?5’d1:

lza\_4\_23\_20\[1\]?5’d2:lza\_4\_23\_20\[0\]?5’d3:5’d24\):

\(lza\_42\_19\_16\.lza\_vld\)?

\(lza\_4\_19\_16\[3\]?5’d4:lza\_4\_19\_16\[2\]?5’d5:

lza\_4\_19\_16\[1\]?5’d6:lza\_4\_19\_16\[0\]?5’d7:5’d24\):

\(lza\_42\_3\_0\.lza\_vld\)?

\(lza\_4\_3\_0\[3\]?5’d20:lza\_4\_3\_0\[2\]?5’d21:

lza\_4\_3\_0\[1\]?5’d22:lza\_4\_3\_0\[0\]?5’d23:5’d24\):

5’d24;

assignlza\_result\_zero=~\|lza\_precod\[23:0\];

### E\.3Verification harness

Every rollout is gated by aniverilogtestbench that co\-instantiates the candidate and a renamed copy of the baseline \(ct\_vfmau\_lza\_simd\_half\_ref\) and compareslza\_resultandlza\_result\_zerocycle by cycle\. The harness contains55phases \(∼1,050\\sim\\\!1\{,\}050vectors total\):

- •Edge cases \(sub\_vld = 0, addition mode\)\.Both operands zero, both all\-ones, MSB\-only and LSB\-only patterns, one operand zero with the other all\-ones, alternating0xAAAAAA/0x5555550xAAAAAA\{/\}0x555555, walking\-1 over all2424bit positions on each operand, adjacent\-bit carry\-propagate patterns, and the three saturated carry chainsP=𝟏,G=𝟏,D=𝟏P\{=\}\\mathbf\{1\},G\{=\}\\mathbf\{1\},D\{=\}\\mathbf\{1\}\.
- •Edge cases \(sub\_vld = 1, subtraction mode\)\.The same zero / all\-ones / boundary / walking\-1 / alternating set, exercising thesub\_vld\-dependent branches oflza\_precod\[0\]andlza\_precod\[23\]\.
- •Random,𝟖𝟎𝟎\\mathbf\{800\}vectors\.400400withsub\_vld = 0and400400withsub\_vld = 1, drawn from$random\(seed\)with seed4242\.
- •Sparse / mixed\.100100vectors withsub\_vldsampled per trial and only22–33random bits set across the two operands, to stress the priority encoder under near\-degenerate inputs\.
- •Close\-path\.100100vectors withaddend=summandXOR\-perturbed at one random bit position undersub\_vld = 1; this is the regime where leading\-zero anticipation dominates the floating\-point add\-round critical path and is the original engineering motivation for the LZA cell\.

A run is accepted only when all∼1,050\\sim\\\!1\{,\}050comparisons pass \(fail\_count=0\\texttt\{fail\\\_count\}=0, “Your Design Passed”\); any single mismatch zeroes the functional reward and the candidate is rejected before PPA synthesis is even invoked\. This is what makes the−59\.4%\-59\.4\\%ADP claim a functional\-equivalence claim against the upstream Xuantie cell rather than a synthesis\-only PPA claim\.

## Appendix FHyperparameter Summary

Table 8:TTT\-RTL hyperparameter configuration used in all experiments\.
## Appendix GSFT Warm\-up Recipe

Thettt\-rtl\-sftcheckpoint that initializes the policy model is a lightweight format\-and\-style warm\-up applied to Qwen3\-8B*before*any test\-time RL\. Its sole purpose is to make the base model reliably emit Verilog inside the expected<think\>\.\.\.</think\>block and‘‘‘verilogfences while respecting the prompt structure of[Appendix˜H](https://arxiv.org/html/2606.05253#A8); it isnotintended as a knowledge distillation step that would short\-circuit the role of test\-time training\.

#### Data source\.

The SFT corpus is5,0005\{,\}000uniformly sampled rows from the public CoDeV\-R1 distillation dataset\[Zhu and contributors,[2025](https://arxiv.org/html/2606.05253#bib.bib36)\]\(87K Verilog reasoning\-chain records in total\), with the original<think\>/<answer\>tags rewritten to match our<think\>convention\. No RTLLM v2\.0 problems are used at SFT time — the warm\-up is purely an out\-of\-distribution format\-and\-style prior, and all benchmark exposure happens during test\-time RL\. The corpus is in the verl multi\-turnmessagesformat with a95/595/5train/val split\.

#### Training\.

We train Qwen3\-8B for33epochs at learning rate10−510^\{\-5\}with FSDP, global batch size6464, max sequence length16,38416\{,\}384, on8×8\\timesA800 GPUs\. All RL experiments in this paper start from the global\-step\-1818checkpoint, which corresponds to roughly1,1521\{,\}152examples seen \(∼0\.23\\sim\\\!0\.23epochs over the corpus\): on average each training row has been seen*at most once*\. This minimal warm\-up is intentional — in pilot runs longer SFT degraded downstream exploration during RL — and the goal is only to lock in response format, not to teach problem\-specific solutions\.

#### Is SFT or test\-time RL doing the work?

Two factors bound the risk that SFT, rather than test\-time RL, drives the headline numbers\. \(1\) The∼0\.23\\sim\\\!0\.23\-epoch budget on a corpus that contains no RTLLM v2\.0 problems is too small for the model to memorize benchmark solutions: the warm\-up shifts response*format*, not benchmark answers\. \(2\) The C910 LZA experiments do not use the SFT checkpoint at all: they initialize from the raw Qwen3\-8B base model \([Table˜8](https://arxiv.org/html/2606.05253#A6.T8)\)\. The*Best\-of\-NN*row of[Table˜3](https://arxiv.org/html/2606.05253#S4.T3)freezes that raw base policy and drawsN=3200N=3200samples onct\_vfmau\_lza\_simd\_half; it never produces a single functionally correct design within the budget\. Since SFT plays no role on C910, the−59\.4%\-59\.4\\%ADP improvement on this unit is attributable to test\-time RL on top of the off\-the\-shelf backbone\. The released artifact ships the SFT configuration and merge script so reviewers can re\-run the warm\-up under any policy of choice\.

## Appendix HPrompt Templates

#### System prompt\.

The system prompt is fixed for all problems:

> You are an expert Verilog RTL designer\. When reasoning, use <think\>\.\.\.</think\> tags\. Output only Verilog code within ‘‘‘verilog fences\.

The templates below are the RTLLM v2\.0 main\-run prompts in which the PPA productM=A⋅D⋅PM=A\\cdot D\\cdot Pis shown to the model\. In the C910 LZA ablation \(where the OpenSTA power column is not collected\) the same templates degenerate to ADP \(M=A⋅DM=A\\cdot D\); we use the generic placeholder “\{M\}” below to make this dual usage explicit\.

#### Root state user prompt\.

> \#\# Reference Implementation Below is a known working implementation that synthesizes to area=\{A\}μ\\mumˆ2, delay=\{D\}ps, power=\{P\}μ\\muW \(PPA\-product=\{M\}\)\. Your task is to produce a functionally correct implementation with lower PPA\-product \(Area×\\timesDelay×\\timesPower\)\. ‘‘‘verilog \{reference\_code\} ‘‘‘ \#\# Design Specification \{specification\} Write a complete Verilog module with PPA\-product lower than \{M\}\. Output only Verilog code within ‘‘‘verilog fences\.

#### Non\-root state user prompt\.

The non\-root prompt appends the following block after the reference and specification:

> \#\# Previous Attempt ‘‘‘verilog \{parent\_code\} ‘‘‘ \#\# Feedback from Previous Attempt \- Syntax: PASS \- Functional Test: PASS \- Synthesis: area=\{A\}μ\\mumˆ2, delay=\{D\}ps, power=\{P\}μ\\muW, PPA\-product=\{M\} Previous PPA\-product: \{prev\_M\} \-\> current: \{M\} \(improved by \{Δ\\Delta\}\) You are iteratively optimizing PPA\-product \(lower is better\)\. Produce a design with PPA\-product lower than \{M\}\.

## Appendix IEntropic Adaptive Beta: Implementation Detail

#### Stage\-1 syntax score\.

Oniverilogfailure the syntax reward isrsyn=1/\(1\+nerr\)r\_\{\\text\{syn\}\}=1/\(1\+n\_\{\\text\{err\}\}\), additionally multiplied by0\.30\.3when the log contains a port\-binding keyword \(port,unknown module,not a module\), and falling back to0\.50\.5when the log cannot be parsed\.

#### Binary search forβ∗\\beta^\{\*\}\.

The binary search forβ∗\\beta^\{\*\}operates on\[10−6,106\]\[10^\{\-6\},10^\{6\}\]with 64 iterations at the current step’s KL budgetδt\\delta\_\{t\}\. TTT\-Discover\[Yuksekgonulet al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib1)\]setsδt≡ln⁡2\\delta\_\{t\}\\equiv\\ln 2as a constant; we report this as the*fixed*baseline in the KL\-budget ablation \([Table˜3](https://arxiv.org/html/2606.05253#S4.T3)\)\. RTL\-Discover replaces the constant with the adaptive controller of[Section˜3\.4](https://arxiv.org/html/2606.05253#S3.SS4); its hyperparameters are listed in[Table˜9](https://arxiv.org/html/2606.05253#A9.T9)\. Intuitively, the constantln⁡2\\ln 2enforces that the group’s softmax distribution is no more concentrated than a Bernoulli\(0\.5\) distribution, preventing the policy gradient from collapsing to a single rollout even when reward differences are large; the adaptive controller relaxes this hard cap and instead modulatesδ\\deltabetween0\.25ln⁡20\.25\\ln 2and4ln⁡24\\ln 2in response to the four EMA signals described in[Section˜3\.4](https://arxiv.org/html/2606.05253#S3.SS4)\.

#### Signals and EMA update\.

At steptt, given the batch ofGGgroups produced byδt−1\\delta\_\{t\-1\}, the controller measures four scalar signals: \(i\) the average policy\-vs\-reference KLKLref\\mathrm\{KL\}\_\{\\mathrm\{ref\}\}, \(ii\) the effective number of distinct rollouts averaged across groups,eff\-n=1G∑g=1G\(∑iqg,i2\)−1\\mathrm\{eff\\text\{\-\}n\}=\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}\(\\sum\_\{i\}q\_\{g,i\}^\{2\}\)^\{\-1\}, \(iii\) the fraction of groups with degenerate \(constant\) reward,const\-frac\\mathrm\{const\\text\{\-\}frac\}, and \(iv\) the binary\-search saturation rateβmax\-rate\\beta\_\{\\max\}\\text\{\-\}\\mathrm\{rate\}\(the fraction of groups whoseβ∗\\beta^\{\*\}hits the search bound10610^\{6\}\)\. Each signal is smoothed by a single EMA with mixingα=0\.30\\alpha=0\.30:

x¯t=\(1−α\)x¯t−1\+αxt,x∈\{KLref,eff\-n,const\-frac,βmax\-rate\}\.\\bar\{x\}\_\{t\}=\(1\-\\alpha\)\\,\\bar\{x\}\_\{t\-1\}\+\\alpha\\,x\_\{t\},\\qquad x\\in\\\{\\mathrm\{KL\}\_\{\\mathrm\{ref\}\},\\,\\mathrm\{eff\\text\{\-\}n\},\\,\\mathrm\{const\\text\{\-\}frac\},\\,\\beta\_\{\\max\}\\text\{\-\}\\mathrm\{rate\}\\\}\.\(8\)A separate slower EMA withαr=0\.10\\alpha\_\{r\}=0\.10tracks the running peak rewardr¯t\\bar\{r\}\_\{t\}used by the stagnation counterNnoimpN\_\{\\mathrm\{noimp\}\}, which is incremented when the batch maximum fails to exceedr¯t−1\\bar\{r\}\_\{t\-1\}by at leastΔrmin=0\.02\\Delta r\_\{\\min\}=0\.02and reset to0whenever a new peak is observed\.

#### Priority ladder\.

For the firstTwarm=20T\_\{\\mathrm\{warm\}\}=20steps only the KL brake \(P1\) is active, withδt≡δ0\\delta\_\{t\}\\equiv\\delta\_\{0\}otherwise\. Past warm\-up, the ladder evaluates P1–P4 in order and applies the assignment of the*first*rule whose guard holds; if none holds,δt←δt−1\\delta\_\{t\}\\leftarrow\\delta\_\{t\-1\}\(hold\)\.

\(P1\) KL brake:KLref¯t\>τKL⇒δt←δ0,\\displaystyle\\overline\{\\mathrm\{KL\}\_\{\\mathrm\{ref\}\}\}\_\{t\}\>\\tau\_\{\\mathrm\{KL\}\}\\;\\Rightarrow\\;\\delta\_\{t\}\\leftarrow\\delta\_\{0\},\(9\)\(P2\) winner\-take\-all:βmax\-rate¯t\>τβ∧eff\-n¯t<τnwta⇒δt←max⁡\(ρ−δt−1,δmin\),\\displaystyle\\overline\{\\beta\_\{\\max\}\\text\{\-\}\\mathrm\{rate\}\}\_\{t\}\>\\tau\_\{\\beta\}\\;\\wedge\\;\\overline\{\\mathrm\{eff\\text\{\-\}n\}\}\_\{t\}<\\tau\_\{n\}^\{\\mathrm\{wta\}\}\\;\\Rightarrow\\;\\delta\_\{t\}\\leftarrow\\max\(\\rho\_\{\-\}\\,\\delta\_\{t\-1\},\\,\\delta\_\{\\min\}\),\(10\)\(P3\) stagnation:Nnoimp≥Tstag∧const\-frac¯t<τconst∧eff\-n¯t<τnstag⇒δt←max⁡\(ρ−δt−1,δmin\),\\displaystyle N\_\{\\mathrm\{noimp\}\}\\geq T\_\{\\mathrm\{stag\}\}\\;\\wedge\\;\\overline\{\\mathrm\{const\\text\{\-\}frac\}\}\_\{t\}<\\tau\_\{\\mathrm\{const\}\}\\;\\wedge\\;\\overline\{\\mathrm\{eff\\text\{\-\}n\}\}\_\{t\}<\\tau\_\{n\}^\{\\mathrm\{stag\}\}\\;\\Rightarrow\\;\\delta\_\{t\}\\leftarrow\\max\(\\rho\_\{\-\}\\,\\delta\_\{t\-1\},\\,\\delta\_\{\\min\}\),\(11\)\(P4\) over\-exploring:eff\-n¯t\>τn∧const\-frac¯t<τconstlo∧Nnoimp<Tplat⇒δt←min⁡\(ρ\+δt−1,δmax\)\.\\displaystyle\\overline\{\\mathrm\{eff\\text\{\-\}n\}\}\_\{t\}\>\\tau\_\{n\}\\;\\wedge\\;\\overline\{\\mathrm\{const\\text\{\-\}frac\}\}\_\{t\}<\\tau\_\{\\mathrm\{const\}\}^\{\\mathrm\{lo\}\}\\;\\wedge\\;N\_\{\\mathrm\{noimp\}\}<T\_\{\\mathrm\{plat\}\}\\;\\Rightarrow\\;\\delta\_\{t\}\\leftarrow\\min\(\\rho\_\{\+\}\\,\\delta\_\{t\-1\},\\,\\delta\_\{\\max\}\)\.\(12\)The*eff\-n*joint gates on P2 and P3 \(τnwta\\tau\_\{n\}^\{\\mathrm\{wta\}\},τnstag\\tau\_\{n\}^\{\\mathrm\{stag\}\}\) are essential: they prevent the controller from shrinkingδ\\deltawhen group weights are already near uniform \(a flat\-reward regime that would collapse the advantages to zero on further shrinkage\), and instead require that the rule that fires acts on a genuinely concentrated group\.

Table 9:Hyperparameters of the adaptive KL\-budget controller \([Section˜3\.4](https://arxiv.org/html/2606.05253#S3.SS4)\)\. All values are held fixed across all problems and PDKs reported in this paper\. Symbols match the priority\-ladder description in[Section˜3\.4](https://arxiv.org/html/2606.05253#S3.SS4)\.

## NeurIPS Paper Checklist

1. 1\.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer:\[Yes\] Justification: The abstract and introduction state three contributions: \(i\) a per\-design test\-time training framework that closes the LLM–EDA loop \([Section˜3](https://arxiv.org/html/2606.05253#S3)\), \(ii\) external comparison on RTLLM v2\.0 reporting a65\.1%65\.1\\%geomean PPA\-product reduction vs\. the v2\.0 reference, against26\.1%26\.1\\%for the strongest published baseline \([Table˜1](https://arxiv.org/html/2606.05253#S4.T1),[Section˜4\.2](https://arxiv.org/html/2606.05253#S4.SS2)\), and \(iii\) an industrial C910 LZA case study with a59\.4%59\.4\\%ADP reduction and component ablations \([Table˜3](https://arxiv.org/html/2606.05253#S4.T3),[Section˜4\.3](https://arxiv.org/html/2606.05253#S4.SS3)\)\. The adaptive KL\-budget controller is framed as a robustness/task\-responsiveness observation rather than a peak\-performance win, matching the multi\-seed evidence in[Appendix˜D](https://arxiv.org/html/2606.05253#A4)\.
2. 2\.Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer:\[Yes\] Justification:\(L1\) Single\-seed main results\.RTLLM v2\.0 and the C910 main\-table results are single\-seed \(matching the published agent baselines, none of which report seed variance\); a four\-seed paired replication on LZAsimd\_halfand a single\-seed case study onct\_vfdsu\_fadd\_close\_s0\_d\([Appendix˜D](https://arxiv.org/html/2606.05253#A4)\) confirm direction and a∼2\.6×\{\\sim\}2\.6\\timesseed\-wise variance reduction but do not reachp<0\.05p<0\.05atn=4n=4, so per\-row gaps in[Table˜3](https://arxiv.org/html/2606.05253#S4.T3)should be read as ranking evidence rather than tight effect sizes\.\(L2\) Simulation\-only correctness\.Designs are validated against the RTLLM testbench, not formal equivalence; a Yosyseqycheck is the natural next step, as is timing\-accurate synthesis \(OpenROAD\[Ajayiet al\.,[2019](https://arxiv.org/html/2606.05253#bib.bib32)\]or commercial flows\) for more reliable PPA estimates\.\(L3\) Internal ablation of the controller\.We do not yet report a leave\-one\-rule\-out study of P1–P4 within the adaptive controller, so we cannot claim which rule drives the gain\.[Appendix˜D](https://arxiv.org/html/2606.05253#A4)additionally documents the controller’s negative results: no metric reachesp<0\.05p<0\.05atn=4n=4on LZA, the faddδ\\deltaraise post\-dates the breakthrough, and the\+0\.67\+0\.67peak gap lies inside the development spread of15\.615\.6–25\.725\.7best\-so\-far reward observed across development repeats of adaptive\-δ\\delta\+ PUCT on the C910 unit\.
3. 3\.Theory assumptions and proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof? Answer:\[N/A\] Justification: The paper does not prove new theorems\.[Section˜3](https://arxiv.org/html/2606.05253#S3)reuses the entropic policy\-gradient objective and binary\-searchβ∗\\beta^\{\*\}from TTT\-Discover\[Yuksekgonulet al\.,[2026](https://arxiv.org/html/2606.05253#bib.bib1)\]and adds an empirical control rule \([Section˜3\.7](https://arxiv.org/html/2606.05253#S3.SS7)\); we do not claim convergence or optimality results\.
4. 4\.Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper? Answer:\[Yes\] Justification:[Section˜4](https://arxiv.org/html/2606.05253#S4)and[Appendix˜F](https://arxiv.org/html/2606.05253#A6)specify the base model \(RTLLM v2\.0: Qwen3\-8B \+ttt\-rtl\-sftstep 18; C910 LZA: raw Qwen3\-8B with no SFT\), training framework \(verl\),100100\-step budget,B=4B=4parents,n∈\{4,8\}n\\in\\\{4,8\\\}rollouts, PUCTc=1\.0c=1\.0, pool capCmax=500C\_\{\\max\}=500, reward weights \([Equation˜3](https://arxiv.org/html/2606.05253#S3.E3)\), KL\-budget controller hyperparameters \([Table˜9](https://arxiv.org/html/2606.05253#A9.T9)\), PDKs \(Nangate 45 nm, Sky130 HD\), tools \(Yosys, OpenSTA, iverilog\), seed \(4242\), and prompt templates \([Appendix˜H](https://arxiv.org/html/2606.05253#A8)\)\. The artifact \(to be released on GitHub\) ships per\-rollout logs, all generated Verilog, synthesis scripts, liberty\-file manifests, and machine\-readable result tables; re\-evaluating any reported ratio from the released Verilog requires no retraining\.
5. 5\.Open access to data and code Question: Does the paper provide open access to the data, code, and instructions needed to faithfully reproduce the main experimental results, as described in supplemental material? Answer:\[Yes\] Justification: All datasets and PDKs are public: RTLLM v2\.0\[Luet al\.,[2024](https://arxiv.org/html/2606.05253#bib.bib28), Liuet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib15)\], the open\-source XuanTie C910 RTL\[Chenet al\.,[2020](https://arxiv.org/html/2606.05253#bib.bib26)\], Nangate 45 nm and Sky130\. The artifact \(to be released on GitHub under Apache 2\.0\) contains synthesis scripts, generated Verilog, prompts, and result CSVs; the training code and SFT checkpoint will be released alongside\.
6. 6\.Experimental setting/details Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer, etc\.\) necessary to understand the results? Answer:\[Yes\] Justification: Per\-run hyperparameters are in[Section˜4](https://arxiv.org/html/2606.05253#S4)\(training config, sampling budget, baselines, synthesis flow, PDK choice\) and[Appendices˜F](https://arxiv.org/html/2606.05253#A6)and[9](https://arxiv.org/html/2606.05253#A9.T9)\. We also disclose the hyperparameter selection protocol: TTT\-Discover defaults are inherited unmodified except for the adaptive KL\-budget controller, whose thresholds were fixed once on the C910 LZA pilot and held constant across all4949RTLLM v2\.0 designs and the second C910 unit \([Section˜3\.7](https://arxiv.org/html/2606.05253#S3.SS7),[Appendix˜D](https://arxiv.org/html/2606.05253#A4)\)\.
7. 7\.Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer:\[Yes\] Justification: The main RTLLM v2\.0 and C910 main\-table numbers are single\-seed, matching all three published agent baselines \(none of EvolVE / VeriAgent / REvolution report seed variance\), and we say so explicitly in[Section˜4](https://arxiv.org/html/2606.05253#S4)and[Table˜3](https://arxiv.org/html/2606.05253#S4.T3)\(“All rows are single\-seed”\)\. For the adaptive vs\. fixedδ\\deltacontrast,[Appendix˜D](https://arxiv.org/html/2606.05253#A4)reports a four\-seed paired replication on LZAsimd\_halfwith mean±\\pmstd and pairedtt\-testpp\-values for four metrics \([Table˜7](https://arxiv.org/html/2606.05253#A4.T7)\), an exhaustive enumeration over all2424adaptive\-to\-fixed pairings, leave\-one\-out checks, and a single\-seed second\-task replication onfadd\_close\_s0\_d\. The negative\-significance results are reported alongside the positive variance\-reduction result\.
8. 8\.Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments? Answer:\[Yes\] Justification: All experiments use A800 GPUs\. RTLLM v2\.0: each of the4949designs runs on22nodes×\\times44A800 GPUs \(88GPUs total\), wall\-clock∼2\{\\sim\}2h per design \(∼16\{\\sim\}16GPU\-h per design\); the full4949\-design sweep is∼780\{\\sim\}780GPU\-h\. C910 LZA ablations: each row runs on11node×\\times88A800 GPUs, wall\-clock∼4\.5\{\\sim\}4\.5h per row \(∼36\{\\sim\}36GPU\-h per row\); the88ablation rows in[Table˜3](https://arxiv.org/html/2606.05253#S4.T3)total∼320\{\\sim\}320GPU\-h\. The seed and replication protocol is in[Section˜4](https://arxiv.org/html/2606.05253#S4)\(“Compute and seed protocol”\)\.
9. 9\.Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics? Answer:\[Yes\] Justification: The work uses publicly released model weights \(Qwen3\-8B\), public benchmarks \(RTLLM v2\.0\), and the open\-source XuanTie C910 RTL under their respective permissive licenses\. No human subjects, no crowdsourcing, no scraped or private data are involved\. The synthesized RTL is hardware\-design output, not personal data\.
10. 10\.Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer:\[Yes\] Justification: TTT\-RTL targets PPA reduction in fixed\-function hardware modules, which can reduce energy and silicon area for deployed designs\. Negative impacts are limited: the framework requires both an LLM and a full EDA flow, so it does not lower the barrier to malicious hardware generation beyond what is already achievable with standard EDA tools\. We do not foresee dual\-use risk that would warrant a release\-time gating mechanism beyond the standard model card\.
11. 11\.Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models with a high risk for misuse \(e\.g\., pretrained language models, image generators, or scraped datasets\)? Answer:\[N/A\] Justification: We release fine\-tuned Verilog\-domain weights of an existing public base model \(Qwen3\-8B\) plus per\-design rollout logs and synthesis scripts\. The artifact does not contain scraped data, pretrained generative models for natural\-language or image content, or any content with elevated misuse risk relative to the base model\.
12. 12\.Licenses for existing assets Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer:\[Yes\] Justification: All third\-party assets are cited in\-text and respected in license: Qwen3\-8B\[Yanget al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib24)\]\(Apache 2\.0\),verl\[Shenget al\.,[2024](https://arxiv.org/html/2606.05253#bib.bib14)\]\(Apache 2\.0\), Yosys\[Wolf and Glaser,[2013](https://arxiv.org/html/2606.05253#bib.bib27)\]\(ISC\), OpenSTA \(GPLv3\), iverilog \(GPLv2\), Nangate Open Cell Library 45 nm \(Apache\-style\), Sky130 PDK \(Apache 2\.0\), RTLLM v2\.0\[Luet al\.,[2024](https://arxiv.org/html/2606.05253#bib.bib28), Liuet al\.,[2025](https://arxiv.org/html/2606.05253#bib.bib15)\]\(MIT\), the XuanTie C910 RTL\[Chenet al\.,[2020](https://arxiv.org/html/2606.05253#bib.bib26)\]\(Apache 2\.0\), and baseline numbers fromPinget al\.\[[2026](https://arxiv.org/html/2606.05253#bib.bib23)\]\(released under MIT\)\. Per\-asset versions are listed in[Sections˜4](https://arxiv.org/html/2606.05253#S4)and[F](https://arxiv.org/html/2606.05253#A6)\.
13. 13\.New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer:\[Yes\] Justification: New assets are: \(a\) the TTT\-RTL training code, \(b\) thettt\-rtl\-sftSFT checkpoint, \(c\) generated Verilog for all49×4=19649\\times 4=196method×\\timesdesign rollout pools, and \(d\) thereference\_ppa\_measured\.csvflow\-sanity table \([Appendix˜B](https://arxiv.org/html/2606.05253#A2)\)\. The artifact contains a top\-level README with the directory layout, run instructions, license \(Apache 2\.0\), and per\-design provenance metadata\.[Appendix˜F](https://arxiv.org/html/2606.05253#A6)doubles as a model card for the SFT checkpoint\.
14. 14\.Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)? Answer:\[N/A\] Justification: No crowdsourcing or human\-subjects studies were conducted\. All evaluation signals come from automated EDA tools \(Yosys, OpenSTA, iverilog\) on public Verilog inputs\.
15. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained? Answer:\[N/A\] Justification: No human\-subjects research was performed; IRB review does not apply\.
16. 16\.Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core method development in this research? Answer:\[Yes\] Justification: An LLM is the central object of study—the policy that TTT\-RTL fine\-tunes per design\. The base model \(Qwen3\-8B\), the SFT corpus \(ttt\-rtl\-sft, step 18\), the training framework \(verl\), the prompt templates \([Appendix˜H](https://arxiv.org/html/2606.05253#A8)\), and all training\-time hyperparameters \([Tables˜8](https://arxiv.org/html/2606.05253#A6.T8)and[9](https://arxiv.org/html/2606.05253#A9.T9)\) are documented in[Section˜4](https://arxiv.org/html/2606.05253#S4)and the appendix\. LLMs were not used as a tool for paper writing or experiment design beyond standard authorial assistance, and any such use does not affect the reported methodology, results, or originality\.
Alpha-RTL: Test-Time Training for RTL Hardware Optimization

Similar Articles

RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

Test-Time Training Undermines Safety Guardrails

Learning Regular Languages with the TTT Algorithm

Submit Feedback

Similar Articles

RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision
StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis
Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation
Test-Time Training Undermines Safety Guardrails
Learning Regular Languages with the TTT Algorithm