@dair_ai: Cool new paper from NVIDIA. Looks like agentic coding is moving into hardware design. HORIZON treats hardware design as…

X AI KOLs Following 06/30/26, 12:11 AM Papers

agentic hardware-design nvidia code-evolution horizon repository-level llm

Summary

NVIDIA proposes HORIZON, a self-evolving agent framework that treats hardware design as repository-level code evolution, achieving 100% benchmark completion across several hardware design suites.

Cool new paper from NVIDIA. Looks like agentic coding is moving into hardware design. HORIZON treats hardware design as repository-level code evolution. A Markdown harness becomes a project pack with domain knowledge, an executable evaluator, an acceptance predicate, and a git policy. The agent then evolves an isolated worktree. That is a strong pattern because hardware design needs executable checks. The verifier harness becomes the real interface between the agent and the design task. The paper reports 100% benchmark completion across several hardware design suites, which makes this one worth tracking even if you do not work on EDA. Paper: https://arxiv.org/abs/2606.28279 Learn to build effective AI agents in our academy: https://academy.dair.ai

Original Article

View Cached Full Text

Cached at: 06/30/26, 03:36 AM

Cool new paper from NVIDIA.

Looks like agentic coding is moving into hardware design.

HORIZON treats hardware design as repository-level code evolution. A Markdown harness becomes a project pack with domain knowledge, an executable evaluator, an acceptance predicate, and a git policy.

The agent then evolves an isolated worktree.

That is a strong pattern because hardware design needs executable checks. The verifier harness becomes the real interface between the agent and the design task.

The paper reports 100% benchmark completion across several hardware design suites, which makes this one worth tracking even if you do not work on EDA.

Paper: https://arxiv.org/abs/2606.28279

Learn to build effective AI agents in our academy: https://academy.dair.ai

Agentic Hardware Design as Repository-Level Code Evolution

Source: https://arxiv.org/html/2606.28279 Cunxi Yu NVIDIA Research &Chenhui Deng NVIDIA Research &Nathaniel Pinckney NVIDIA Research &Brucek Khailany NVIDIA Research

Abstract

We present HORIZON, a self-evolving agent framework that treats hardware design as repository-level code evolution. A Markdown harness is compiled into a project pack containing domain knowledge, an executable evaluator, an acceptance predicate, and a git/runtime policy; a hands-free agent loop then evolves an isolated git worktree, using repository operations for state management, tracing, and replay. This extends prior works of repository-scale self-evolution from EDA software systems, to hardware-design artifacts themselves. We evaluate our approach on ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories, achieving 100% benchmark completion across all suites with a fully hands-free agentic loop. However, we do not claim that agentic AI for hardware design is solved: these benchmarks are controlled proxies for a much broader engineering problem in chip design. Section5examines the limitations of the current study and highlights open research challenges.

1Introduction

Executable design tasks expose a limitation of single-turn code generation. A useful agent must place candidate artifacts inside a runnable workspace, invoke domain tools, interpret failures, and revise the artifacts until an explicit acceptance condition is satisfied. RTL design is a sharp test case: correctness depends on cycle-level behavior, reset and interface conventions, bit widths, and simulator feedback, so plausible Verilog is not enough.

Our motivation comes from repository-scale code self-evolution. AlphaEvolve showed that LLMs with automated evaluators can improve algorithmic kernels(Novikov et al.,2025); SATLUTION scaled the idea to full SAT-solver repositories(Yu et al.,2025); and ABCEvo applied it to theABClogic-synthesis system(Yu et al.,2026). In all cases, the agent evolves a version-controlled software artifact and admits changes only when executable evidence supports them. The missing step is hardware: prior repository-level self-evolution changes the programs that engineers run, not the hardware designs engineers create.

Here, we ask“whether hardware design itself can be managed as repository-level code evolution“. Inspired by prior work(Yu et al.,2026,2025), HORIZON turns a design problem into a self-contained git worktree with an executable acceptance gate. A structured Markdown harness specifies the objective, domain knowledge, evaluator, and acceptance predicate; a bootstrap agent compiles it into a project pack; and a hands-free agent loop edits, evaluates, commits, or rejects candidate versions. Git is not incidental bookkeeping in this design. It provides the isolated evolving environment and the trace substrate: diffs expose state changes, commits define accepted checkpoints, logs and notes store evaluator evidence, and the repository history becomes a replayable record of the agent’s search.

This framing is broader than the benchmarks in this paper. Agentic AI for hardware design includes architecture exploration, microarchitecture, verification planning, physical-design interaction, EDA software, and methodology development; we do not claim that RTL agents are solved. We use RTL benchmarks as controlled, executable proxies for studying whether repository-managed feedback can drive convergence. The evaluation spans ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories, including completion, modification, reuse, testbench stimulus, checker and assertion generation, and debugging(Pinckney et al.,2025).

This paper makes three contributions. First, we introduce HORIZON, a framework that hosts hardware design tasks as isolated, version-controlled, automatically evaluated repositories rather than as one-shot prompts. Second, we show that this git-native self-evolution loop can sweep complete RTL benchmark suites to a 100% pass rate, with the only residual failure traced to a known specification-harness mismatch. Third, we analyze the resulting traces, including token consumption and test-generation coverage, and show that once executable feedback makes correctness converge, the main research bottleneck becomes convergence efficiency and verification quality.

2Background and Related Work

Why RTL is hard for language models.

RTL generation differs from ordinary code completion because the output defines hardware that must satisfy temporal and bit-accurate behavior. A model must infer datapath widths, finite-state-machine transitions, reset conventions, ready-valid protocols, memory behavior, and corner cases that are often underspecified in natural language. A syntactically valid module is only a starting point; useful automation must connect generation to compilation, simulation, waveform or trace inspection, and repair.

RTL-specialized models and data.

One body of work improves the generator itself. Early studies fine-tuned open models on Verilog corpora and established that domain adaptation matters: VeriGen curated large HDL training data and benchmarked open and closed models(Thakur et al.,2024), RTLCoder released an open dataset and a lightweight model that surpassed GPT-3.5 on RTL generation(Liu et al.,2025), and ChipNeMo domain-adapted LLMs across chip-design tasks(Liu et al.,2023). More recent efforts target data quality and reasoning: OriGen uses code-to-code augmentation with self-reflection(Cui et al.,2024), CraftRTL constructs correct-by-construction synthetic data including non-textual representations(Liu et al.,2024), and ScaleRTL scales RTL reasoning data and adds test-time reasoning, improving Verilog-Eval and RTLLM(Deng et al.,2025). These approaches strengthen first-attempt accuracy but do not, by themselves, define how an agent should iterate against an executable harness.

Iterative and agentic RTL.

A complementary body of work adds tool use and iteration on top of a generator. AutoChip drives a generate-compile-simulate feedback loop(Thakur et al.,2023); RTLFixer repairs syntax errors with retrieval-augmented, ReAct-style debugging(Tsai et al.,2024); VerilogCoder plans with a task-and-circuit-relation graph and traces waveforms via an AST-based tool to localize functional bugs(Ho et al.,2025); MAGE decomposes a design across cooperating agents with high-temperature sampling and checkpoint-based debugging(Zhao et al.,2025); and ACE-RTL pairs an RTL-specialized generator with a frontier-model reflector and coordinator that evolves the prompting context over repair steps(Deng et al.,2026). These systems show that verification feedback substantially improves correctness, but each is built around a generation-and-repair pipeline for individual modules. HORIZON is complementary and more general: it is agnostic to the underlying generator and instead specifies how tohost the entire problem as a versioned repository, organize and gate the repair loop with native git operations, and drive a whole benchmark suite to completion, so any backbone can be evaluated on convergence rather than only on first-attempt accuracy.

Benchmarks for RTL design and verification.

Verilog-Eval and RTLLM are widely used RTL generation benchmarks and remain useful for measuring basic specification-to-RTL ability(Liu et al.,2023;Lu et al.,2024); ChipBench and consolidated suites such as OpenLLM-RTL aggregate further generation problems(Liu et al.,2025). However, the CVDP paper argues that earlier suites are increasingly saturated and too narrow to represent real hardware design workflows(Pinckney et al.,2025). CVDP contains 783 human-authored problems across 13 task categories. Its code-generation side includes RTL code completion, natural-language-specification to RTL, code modification, module reuse, linting or quality improvement, testbench stimulus generation, checker generation, assertion generation, and debugging. Its comprehension side includes RTL/specification correspondence, testbench/test-plan correspondence, and technical question answering.

CVDP also distinguishes non-agentic and agentic settings. Non-agentic problems provide the prompt and relevant context in a single turn, while agentic problems are packaged as mini-repositories intended for Dockerized agents that can inspect files and invoke tools. The benchmark reports 617 non-agentic and 166 agentic problems after quality filtering. The authors emphasize that state-of-the-art single-shot models struggle particularly on verification-oriented tasks such as testbench generation, checker generation, assertion generation, and bug fixing. This makes CVDP a strong fit for evaluating whether an agent can improve beyond first-attempt model accuracy through execution feedback.

Self-evolving agents over code repositories.

A separate line of work treats the codebase itself as the object that an agent evolves. AlphaEvolve coupled an LLM with automated evaluators and an evolutionary loop to discover and refine algorithms at the scale of isolated kernels(Novikov et al.,2025). SATLUTION extended this to the full repository scale, evolving entire C/C++ SAT-solver repositories under strict correctness guarantees and distributed runtime feedback while also self-evolving its own evolution rules, ultimately outperforming the human-designed SAT-competition winners(Yu et al.,2025). ABCEvo carried the idea into EDA, using coordinated LLM agents to autonomously rewrite the million-lineABClogic-synthesis system under a correctness- and QoR-driven evaluation loop(Yu et al.,2026). All three evolve EDA or scientificsoftware, the programs that engineers run, under the shared principle that a candidate change is useful only when it survives executable correctness checks and improves measured behavior. HORIZON carries the same repository-self-contained, evidence-gated principle to thehardwareside: rather than evolving a solver or synthesis kernel, it evolves the design under test, RTL sources, testbenches, and verification artifacts, as an automatically constructed task over a git worktree. This lets one protocol cover per-task RTL generation, completion, and repair, and is what allows HORIZON to treat an RTL benchmark problem “as is” rather than reformulating it into a software-engineering surrogate.

Git as an agent substrate.

Closely related to our implementation is a recent trend of using version control itself as the scaffolding for LLM agents. EvoGit coordinates a population of decentralized coding agents purely through a Git-based phylogenetic graph that records the full version lineage, with no shared memory or explicit message passing(Huang et al.,2025). Git Context Controller elevates an agent’s context from a transient token stream to a persistent, version-controlled memory with explicitcommit,branch, andmergeoperations for long-horizon software tasks(Wu et al.,2025). Both demonstrate that git semantics, branching, lineage, and checkpointing, are a natural fit for organizing agentic exploration, but both targetsoftwareengineering: collaborative code generation and context management, respectively. HORIZON shares the conviction that git is the right substrate, and likewise records every attempt as a commit with attached evaluator evidence, but applies it to a different end, hosting an individual hardware-design problem as a self-contained, verification-gated repository whose history doubles as a replayable experience trace, which we view as convergent evidence that repository-native agents are an emerging paradigm rather than a single-domain trick.

3The HORIZON Framework

System overview.

Figure1summarizes HORIZON. The central idea is to manage a hardware-design problem like a software-evolution problem: the design, context, harness, and evaluator with correctness gate live in an isolated git worktree, and progress is expressed as repository state changes rather than as disconnected chat turns. A user provides a structured Markdown harness; a bootstrap agent compiles it into aproject pack, the control plane that fixes the mission, domain skills, executable evaluator, correctness gate, and git/runtime policy. From then on, a self-contained agent loop evolves the worktree without further human input. Each cycle generates or edits candidate artifacts, runs the evaluator, scores the result, and either commits the new version or logs the failure. Native git functions provide both isolation and traceability: diffs expose proposed state changes, commits define accepted checkpoints, logs recover the trajectory, and notes/runtime summaries attach evaluator evidence. The same machinery is intended to host versatile chip-design work, including RTL, EDA-software research, and methodology or flow exploration. In this paper we exercise the RTL instantiation; the broader design-space claim is a framework goal rather than a completed empirical validation.

From harness to executable task.

HORIZON views language-model agents as policies acting on executable workspaces. The framework is not specific to RTL or to EDA self-evolution: any task with a persistent git worktree, machine-checkable feedback, and versioned artifacts can be organized in the same way. The only required user input is a structured Markdown harness, which may contain high-level intent, repository context, expected artifacts, evaluation criteria, and domain knowledge. Domain-aware harnesses are especially useful because they expose invariants, tool conventions, and failure modes that are difficult to infer from files alone.

The bootstrap agent converts this harness into a project pack. Letmmdenote the input harness. A bootstrap tool loopGϕG_{\phi}constructs

p=Gϕ(m),p=(πagent,Ep,Ap,Γp,Ωp),p=G_{\phi}(m),\qquad p=(\pi_{\mathrm{agent}},E_{p},A_{p},\Gamma_{p},\Omega_{p}),(1)whereπagent\pi_{\mathrm{agent}}is the agent policy prompt and tool contract,EpE_{p}is an executable evaluator or harness,ApA_{p}is the acceptance predicate,Γp\Gamma_{p}is the version-control and artifact policy, andΩp\Omega_{p}contains domain skills and repository instructions. For RTL,EpE_{p}may include compilation, simulation, coverage extraction, and assertion or testbench checks. In other domains, the same slot may be filled by unit tests, theorem provers, profilers, security scanners, synthesis tools, or human-review gates. Problems are therefore defined over git worktrees rather than over a fixed target repository type.

Refer to caption

Figure 1:Overview of HORIZON. A human-defined Markdown harness is converted into a project pack that specifies the mission, domain skills, evaluator, correctness gate, and git/runtime policy. The resulting task is solved by a self-contained agent loop over git-traced repository states. The executable task instance provides evaluator feedback and reward, while the depth-kktrace buffer records version history, state deltas, evaluator outcomes, and runtime summaries.

Repository-traced formulation.

HORIZON is an agentic system: the underlying policy is a free-form, history-dependent LLM agent, and we make no claim that its behavior is Markovian. We borrow the vocabulary of a semi-Markov decision process for one narrow purpose, to give precise, replayable names to the objects we record, not as a behavioral or optimization assumption. Because each accepted version follows a temporally extended episode of many edits, tool calls, and partial repairs, it is natural to label the boundaries: a “state” is a versioned snapshot of the repository (a bookkeeping checkpoint, not a sufficient statistic of the agent’s reasoning), and an “option” is one such episode between two checkpoints. With that caveat, the objects below are simply definitions of what each trace stores. At outer checkpointtt, the recorded state is

st=(tree(wt),p,zt,ℓ≤t,μt),s_{t}=\big(\mathrm{tree}(w_{t}),\,p,\,z_{t},\,\ell_{\leq t},\,\mu_{t}\big),(2)wheretree(wt)\mathrm{tree}(w_{t})is the git tree of the current worktree,ppis the project pack,ztz_{t}is campaign state,ℓ≤t\ell_{\leq t}are accumulated logs and evaluator artifacts, andμt\mu_{t}is any declared memory that the policy is allowed to condition on. The agent samples a variable-length option

at=(Δt,ut,1:Kt,ρt),a_{t}=(\Delta_{t},\,u_{t,1:K_{t}},\,\rho_{t}),(3)whereΔt\Delta_{t}is the proposed patch or generated artifact set,ut,1:Ktu_{t,1:K_{t}}are theKtK_{t}tool calls and observations inside the option, andρt\rho_{t}is the final review or submission decision. The evaluator produces evidence

yt=Ep(wt⊕Δt),y_{t}=E_{p}(w_{t}\oplus\Delta_{t}),(4)and the acceptance predicate determines whether the trace advances:

st+1={Commit(wt⊕Δt,yt,Γp),Ap(yt)=1,RejectLog(st,Δt,yt),Ap(yt)=0.s_{t+1}=\begin{cases}\mathrm{Commit}(w_{t}\oplus\Delta_{t},\,y_{t},\,\Gamma_{p}),&A_{p}(y_{t})=1,\\[2.0pt] \mathrm{RejectLog}(s_{t},\,\Delta_{t},\,y_{t}),&A_{p}(y_{t})=0.\end{cases}(5)The reward can be scalar or vector-valued, for example,

rt=Rp(yt)=[Δpass,Δcoverage,ΔQoR,−tokens,−time],r_{t}=R_{p}(y_{t})=\big[\Delta\mathrm{pass},\,\Delta\mathrm{coverage},\,\Delta\mathrm{QoR},\,-\mathrm{tokens},\,-\mathrm{time}\big],(6)and an individual coordinate is populated only when the evaluator emits the corresponding signal; in this paper we report theΔpass\Delta\mathrm{pass},Δcoverage\Delta\mathrm{coverage}, and−tokens-\mathrm{tokens}components and leave synthesis quality-of-results (ΔQoR\Delta\mathrm{QoR}) to future work. An execution trace of depthDDis then

τ0:D={(st,at,rt,st+1,yt)}t=0D−1.\tau_{0:D}=\{(s_{t},a_{t},r_{t},s_{t+1},y_{t})\}_{t=0}^{D-1}.(7)The depthDDis not fixed by the benchmark; it is determined by the campaign budget, convergence, or stopping rule. This makes the trace suitable for policy analysis, reward modeling, curriculum construction, or offline agent-RL training. We stress that we use this formulation only to structure and record the search; we do not train or update an RL policy in this work, and our agent backbone is fixed throughout a campaign.

Agent loop and trace buffer.

Once the user supplies the initial Markdown harness, the loop is completely hands-free: bootstrap, generation, evaluation, acceptance, logging, and the next iteration all run automatically, and a campaign proceeds for many iterations with no further human intervention. Each outer transition contains an internal trajectory of depthKtK_{t}, the agent reads the current state, plans a target, edits the worktree, invokes tools, interprets failures, and either repairs or submits, and this inner trajectory is not assumed to be Markov and can differ in length at every step. We deliberately build the trace buffer on top of native git so that tracing is dynamic and essentially free to maintain: staged edits are inspected withgit diff --cached, each accepted attempt becomes agit commitwhose message and attachedgit notescarry the evaluator verdict and reward, the full version history is recovered withgit log, and an independent review step diffs the candidate before it is allowed to submit. Successful commits become positive examples of repair strategies while rejected attempts are logged as negative examples of edits or tool-use paths that failed the evaluator, so the repository’s own historyisthe experience buffer rather than a separate datastore.

Memory and session reuse.

Because there is no true Markov state, the process is just a sequence of agent actions and LLM responses, memory is handled pragmatically rather than as a state variable, and the operative objective is to maximize the share ofcachedtokens relative to freshly billed input and output tokens. HORIZON reuses a persistent model session across iterations so that the harness, project pack, stable sources, and accumulated debugging context are served from the provider’s prompt cache instead of being re-sent every turn; the newly billed tokens are then dominated by the current diff, the latest evaluator output, and the agent’s response. This keeps the marginal cost of an additional repair iteration low even when a campaign runs for dozens of iterations, and it is the main reason cumulative token usage is overwhelmingly cached input (Section4.2). The agent may still condition on session memoryhth_{t},

at∼πθ(⋅∣st,ht),ht+1=M(ht,st,at,yt),a_{t}\sim\pi_{\theta}(\cdot\mid s_{t},h_{t}),\qquad h_{t+1}=M(h_{t},s_{t},a_{t},y_{t}),(8)but the source of truth remains the git worktree, the project pack, the evaluator outputs, and the versioned trace; campaign and review memories are kept separate so that review remains an independent check.

4Experiments

Setup and protocol.

Model:we use GPT-5.3 as the agent backbone for all experiments, fixed throughout;Benchmarks:ChipBench, RTLLM-2.0, and Verilog-Eval, together with all CVDP code- and verification-generation categories (CID 002 to 016) spanning completion, specification-to-RTL, modification, reuse, linting/QoR, and stimulus, checker, and assertion generation as well as debugging.Host environment:all campaigns run on an AMD EPYC 9334 32-Core processor with 512 GB of RAM, with evaluators invoking each suite’s native open-source and, where required, commercial-EDA flows.Task construction:for each problem the bootstrap agent builds a project pack whose evaluator wraps the suite’s native harness (compilation, simulation, and where available coverage or assertion checks), with the acceptance predicate set to the harness pass condition.Protocol:aniterationis one automated outer step in which the agent edits the worktree, runs the evaluator, and either commits a passing version or logs a rejection; we report best-so-far pass rate, the fraction of tasks for which a passing version has been committed by a given iteration, and define theearliest-bestiteration as the first iteration that attains a benchmark’s maximum observed pass rate. The entire loop is hands-free, and all results presented in this paper are obtained in single-agent mode.

Suite/categoryEvaluation focusEDAIter. 0bFinal iter.HORIZONChipBenchMixed RTL generation tasksOpen20.05100.0aRTLLM-2.0Natural-language spec to RTLOpen78.02100.0Verilog-Evalv2HDLBits-style Verilog generationOpen86.22100.0CVDP CID 002RTL code completionOpen3.282100.0CVDP CID 003Natural-language spec to RTLOpen19.224100.0CVDP CID 004RTL code modificationOpen10.936100.0CVDP CID 005Spec-to-RTL module reuseOpen9.114100.0CVDP CID 007Linting / QoR improvementOpen0.024100.0CVDP CID 012Test-plan to stimulus generationComm.47.832100.0CVDP CID 013Test-plan to checker generationComm.3.819100.0CVDP CID 014Test-plan to assertion generationComm.79.11100.0CVDP CID 016Debugging and bug fixingOpen25.713100.0OverallAll evaluated RTL benchmarks47.8100.0Table 1:Pass rates (%) from a single HORIZON run.Final iter.denotes the iteration at which HORIZON converges.EDAindicates the evaluation backend, open-source (Open) or commercial (Comm.); only CID 012–014 require a commercial simulator. HORIZON achieves 100% completion on every suite.aOne non-passing ChipBench task is due to a specification–harness defect in the original benchmark; counting it as resolved yields 100%.bIter. 0is the pass rate after the first agent iteration, not the standalone LLM [email protected]1starts from the benchmark surface rather than only the final score. The evaluated tasks span compact RTL generation suites, legacy specification-to-RTL benchmarks, and nine CVDP categories that exercise completion, specification implementation, modification, module reuse, code improvement, testbench stimulus generation, checker generation, assertion generation, and debugging.

4.1Benchmark completion and pass-rate progression

Run as a single hands-free agentic loop per benchmark set, HORIZON drives every benchmark suite to a 100% pass rate (Table1); the only residual miss is a single ChipBench task, which we trace to a specification–harness mismatch in the original benchmark rather than to agent failure. What varies across suites is therefore not the destination but the path to it. At the agent’s first iteration, the aggregate pass rate is 47.8%, and it is substantially lower on the hardest CVDP categories, including 3.2% on code completion (CID 002) and 3.8% on checker generation (CID 013), before the same loop eventually closes the gap to 100%. We emphasize that the iteration-0 results are not standalone model Pass@1 measurements. Instead, they reflect the state of the repository after the first step of the agentic evolution process, executed under the same prompting strategy and workflow used throughout the run. As a result, the agent may defer substantial exploration, debugging, and repair to later iterations rather than attempting to maximize first-pass success. This first-iteration aggregate is buoyed by Verilog-Eval-v2, which already reaches 86.2% at iteration 0, whereas the CVDP subset starts at 23.9%. We therefore report not merely that the suites are completed, but also how the agentic loop reaches completion.

Refer to caption (a)Simple RTL generation suites. (b)CVDP categories.

Figure 2:Best-so-far pass-rate trajectories over agent iterations. Earlier RTL generation suites saturate within a few iterations, while CVDP exposes longer repair trajectories and clearer differences in convergence difficulty.Figure2shows the qualitative difference between the benchmark families. RTLLM-2.0 and Verilog-Eval reach 100% within two iterations; ChipBench climbs from 20.0% to 100% over five iterations, where the single task not passed under the original harness is the benchmark defect noted in Table1. CVDP categories require more varied repair budgets. CID 014 reaches 100% after one iteration, CID 016 and CID 005 reach 100% in 13 and 14 iterations, CID 013 requires 19 iterations, CID 003 and CID 007 require 24 iterations, CID 012 requires 32 iterations, CID 004 requires 36 iterations, and CID 002 requires 82 iterations. The long tail in CID 002 is especially informative: it is not a one-shot modeling failure, but a convergence problem where the agent gradually converts many failing completions into passing designs.

The two extremes of difficulty are also the two most informative trajectories. CID 013 (adding reference-model checker logic to a testbench, evaluated under commercial-EDA simulation) rounds out the verification-generation family alongside stimulus generation (CID 012) and assertion generation (CID 014), and has the lowest first-iteration pass rate of any category, 3.8%, consistent with the CVDP finding that checker writing is especially hard for single-shot models. Yet despite this weak start it reaches 100% by iteration 19 along a strikingly steady, near-linear trajectory, climbing at a near-constant rate with almost no plateau. CID 013 and CID 002 thus bracket the behavior of the loop: a very low first-iteration rate does not by itself imply slow or unstable convergence (CID 013), while a long tail on a few stubborn designs is what actually drives cost (CID 002).

4.2Token Consumption Result

We next report how much agent work each completion requires, measured as the cumulative tokens consumed through a run’s earliest-best iteration. This is not a normalized economic cost, model pricing, parallelism, and infrastructure are excluded, but it is a useful first-order measure of effort, and (per the session-reuse design) it is overwhelmingly cached input rather than freshly billed tokens.

Suite/categoryIter.Tokens (M)ShareChipBench52.81.3%RTLLM-2.021.30.6%Verilog-Evalv222.01.0%CVDP CID 0028256.026.7%CVDP CID 0032438.018.1%CVDP CID 0043623.711.3%CVDP CID 005149.14.4%CVDP CID 0072421.610.3%CVDP CID 0123232.215.3%CVDP CID 0131914.26.7%CVDP CID 01410.30.1%CVDP CID 016138.84.2%Total209.9100.0% [Uncaptioned image]

Table 2:Token consumption through the earliest-best iteration, as cumulative tokens recorded in the agent launch logs. Left: tokens (millions) and share of the total per benchmark, with the convergence iteration. Right: the same shares as a donut, with the three legacy suites grouped. Cost is dominated by a few hard CVDP categories.Note that approximately 91% of all tokens are cached input, which significantly lowered the API cost.Shares may not sum to 100.0% due to rounding. Refer to caption (a)Normalized cumulative token usage. (b)Absolute cumulative token usage.

Figure 3:Cumulative token-usage trajectories, truncated at each suite’s convergence point (no spending beyond the best iteration is shown). The normalized view reports cumulative tokens as a percentage of the tokens consumed at convergence, while the absolute view reports cumulative tokens in millions on a log scale.Table2and Figure3show that token consumption is concentrated in the most challenging CVDP categories. The three legacy suites together consume 6.0M tokens, whereas the nine CVDP categories consume 203.9M tokens, accounting for 97.1% of the total. Among these, CID 002 alone uses 56.0M tokens, CID 003 uses 38.0M, and CID 012 uses 32.2M. CID 013 is comparatively efficient given its difficulty: despite having the lowest first-iteration pass rate, it converges after consuming only 14.2M tokens, consistent with its steady, plateau-free trajectory.

The practical takeaway is that benchmark completion should be reported together with token consumption. Although HORIZON clears every suite, the final few percentage points on the most difficult categories absorb a disproportionate share of the budget. Consequently, we view token efficiency, rather than final pass rate, as the metric most in need of improvement. Notably, approximately 91% of all tokens are cached input tokens.

4.3Detailed discussion on test-generation tasks

The test-generation categories deserve a closer look because they expose what HORIZON is and is not optimizing. For the two categories with parsed coverage data (CID 012 stimulus and CID 014 assertion generation), we additionally measure design coverage of the generated tests, reported as the average coverage percentage over designs with parsed coverage logs at each iteration. The crucial point is that HORIZON’s acceptance gate is theCVDP pass condition, not a coverage target: the loop is driven to make the benchmark’s own harness pass, and once a design passes, the gate is satisfied and the loop stops refining it. Coverage is therefore a secondary, observational signal here, it reports how much of the design the passing tests happen to exercise, rather than the objective being maximized.

Table 3:Coverage summary for the CVDP verification-generation categories. CID 012 is test-plan to testbench stimulus generation; CID 014 is test-plan to assertion generation. CID 014 has seven designs without parsed coverage rows at the best iteration, so its best-coverage average is computed over the remaining 60 parsed logs.aSimilarly as Table1, Iter. 0 denotes the first agent iteration and should not be interpreted as a standalone model Pass@1 or one-shot generation result; it reflects the repository state after the first step of the agentic evolution process. Refer to caption Figure 4:Coverage and pass rate for CID 012 (testbench stimulus generation), both truncated at iteration 32, where the pass rate first reaches 100% (the convergence-point convention of Figures2–3).Left: average design coverage rises gradually from 86.5% to 97.9% while the pass rate climbs from 47.8% to 100% over the same iterations; the two move together but coverage plateaus below 100% because the acceptance gate stops each design once it passes.Right: per-design coverage curves with their mean (bold). The improvement comes from lifting a low-coverage tail up toward full coverage, rather than only nudging already-high-coverage designs.Table3and Figure4make the stopping behavior concrete. CID 012 reaches a 100% pass rate at iteration 32, but its average per-design coverage at that point is 97.9%, not 100%. This gap is expected and is a direct consequence of the acceptance gate: the loop halts on each design as soon as the CVDP harness passes, so coverage simply reflects the tests that were sufficient to pass rather than the maximum achievable. The per-design view (Figure4, right) shows the loop lifting a low-coverage tail, several designs that begin near 20% to 40% coverage are pulled up toward full coverage, rather than only nudging already-high-coverage designs. CID 014 is the opposite regime: it starts near 98% coverage and saturates immediately, so we report it only in Table3; its 100.0% best value is computed over the 60 designs with parsed coverage logs and should be read as coverage over those logs, not as evidence that every design emitted a usable report. We emphasize that we didnotattempt to drive coverage to 100%: doing so would mean replacing the pass-based acceptance predicate with a coverage target, which HORIZON supports in principle but we leave to future work. Coverage here is thus a diagnostic that the generated tests are substantive, not a claim of exhaustive verification.

5Discussion and Limitations

The main takeaway of this work is that repository-managed executable feedback can make many RTL benchmark tasks converge. It is not that agentic hardware design is solved. The results should be read asbenchmark convergence under the feedback made available to the agent. Real chip projects involve incomplete specifications, changing constraints, downstream integration, human review, PPA tradeoffs, and validation targets that are not fully represented by current RTL benchmarksLiu et al.(2023);Lu et al.(2024);Liu et al.(2025);Pinckney et al.(2025).

The most important limitation is the reward-feedback interface. In the current HORIZON setup, the agent can inspect the outputs of each iterative evaluation. In CVDPPinckney et al.(2025), for example, this includes simulator messages, evaluation logs, failure traces, and other task-local artifacts exposed by the benchmark harness. This mirrors a realistic debug workflow: engineers normally inspect logs and counterexamples, and rich feedback is what makes long-horizon repair feasible. At the same time, full access to these signals can create anover-solvingor reward-hacking failure mode. The agent may customize the generated RTL to match the observed failures, deterministic tests, or evaluator idiosyncrasies rather than implement the intended design semantics robustly. A passing result can therefore mean “satisfies the visible harness under the exposed traces” rather than “satisfies the specification under all reasonable tests.” This risk is especially relevant when benchmarks reveal detailed failure information or when the final acceptance test is the same harness used throughout the repair loop.

This raises a broader benchmarking issue. Existing RTL-agent benchmarks generally do not include a mechanism to detect over-solving or reward hackingLiu et al.(2023);Lu et al.(2024);Liu et al.(2025);Pinckney et al.(2025). They measure whether the submitted artifact passes the provided harness, but they usually do not separate debugging feedback from final hidden scoring, test robustness under randomized or perturbed stimuli, or audit whether a solution has specialized to artifacts of the evaluator. This is an open problem for the community. Future benchmarks for agentic hardware design should consider a two-level protocol: expose useful diagnostic feedback during repair, but reserve hidden randomized tests, independent reference models, formal equivalence checks, property suites, or held-out simulator configurations for final scoring. Reporting robustness to harness perturbations, coverage closure, and traces of what feedback the agent consumed would also make it easier to distinguish genuine design repair from benchmark-specific adaptation.

This tension has a direct parallel in software-engineering benchmarks. SWE-bench addresses it through structural test withholdingJimenez et al.(2024);Wang et al.(2026). Agents receive a GitHub issue description and a repository snapshot at a fixed commit, while the fail-to-pass and pass-to-pass tests used for evaluation are withheld during inference and executed only after a final patch is submitted. This separation between repair-time information and evaluation-time scoring reduces opportunities for reward hacking and benchmark-specific adaptation. Subsequent analyses have shown that benchmark design choices such as solution leakage and weak test suites can substantially inflate measured performanceAleithan et al.(2024), and that patches deemed successful by benchmark tests may still diverge from developer-intended behavior or contain latent correctness issuespengfeigao1(2024). These findings suggest that future RTL-agent benchmarks should similarly separate diagnostic feedback from final scoring and incorporate robustness checks beyond the visible repair loop, such as hidden randomized tests, independent reference models, formal equivalence checking, or held-out verification environments.

Another major limitation is feedback turnaround. The RTL pass/fail benchmarks in this paper are relatively favorable because most evaluations complete quickly enough for iterative repair. In broader chip-design self-evolution, reward evaluation can be far slower. We have also studied PPA optimization in RTL design loops and PPA-oriented EDA-tool evolution, including the ABCEvo-style settingYu et al.(2026), where the reward may require synthesis, placement, routing, timing analysis, power estimation, or large regression suites. SATLUTION already illustrates the cost of accurate repository-level reward: evaluating an entire SAT-competition benchmark required roughly a two-hour turnaround even with about 800 nodes running in parallelYu et al.(2025). For RTL PPA optimization or EDA-tool self-evolution, the turnaround can grow to days or weeks depending on design size, evaluation stage, and signoff fidelity. Long-latency reward fundamentally changes the agentic system problem: naive edit-evaluate-repair loops become too slow, and the agent must reason under delayed, sparse, and expensive feedback. Addressing long-turnaround reward is therefore a key research challenge for agentic chip design.

6Conclusion

We presented HORIZON, a self-evolving agent framework that treats hardware design as repository-level code evolution. A human-written Markdown harness is compiled into a project pack containing domain knowledge, an executable evaluator, an acceptance predicate, and a git/runtime policy; a hands-free agent loop then evolves an isolated repository worktree until the acceptance criterion is satisfied. Building on prior repository-scale self-evolution systems for EDA software, HORIZON extends the same paradigm to hardware-design artifacts themselves.

Across ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories, HORIZON achieves 100% benchmark completion with a fully hands-free agentic loop. To our knowledge, this is the first agentic system to complete all evaluated RTL benchmark suites end-to-end without human intervention. With correctness largely saturated on these benchmarks, the more informative signal becomes the cost of reaching that outcome. We find that token consumption is concentrated in a small number of difficult categories and that approximately 91% of all tokens are cached input tokens, making token efficiency a more meaningful target for future improvement than final pass rate.

At the same time, we do not claim that agentic hardware design is solved. Current RTL benchmarks are controlled proxies for a much broader engineering problem and leave open important questions around reward hacking, robustness, long-horizon design planning, and deployment in production design flows. We hope this work helps establish a path from benchmark-scale RTL generation toward reliable agentic systems for real-world chip design.

References

Novikov et al. (2025)Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, et al.AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery.arXiv preprint arXiv:2506.13131, 2025.
Yu et al. (2025)Cunxi Yu, Rongjian Liang, Chia-Tung Ho, and Haoxing Ren.Autonomous Code Evolution Meets NP-Completeness.arXiv preprint arXiv:2509.07367, 2025.
Yu et al. (2026)Cunxi Yu, Rongjian Liang, Chia-Tung Ho, and Haoxing Ren.Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC.arXiv preprint arXiv:2604.15082, 2026.
Pinckney et al. (2025)Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, and Haoxing Ren.Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification.arXiv preprint arXiv:2506.14074, 2025.
Deng et al. (2026)Chenhui Deng, Zhongzhi Yu, Guan-Ting Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren.ACE-RTL: When Agentic Context Evolution Meets RTL-Specialized LLMs.arXiv preprint arXiv:2602.10218, 2026.
Deng et al. (2025)Chenhui Deng, Yun-Da Tsai, Guan-Ting Liu, Zhongzhi Yu, and Haoxing Ren.ScaleRTL: Scaling LLMs with Reasoning Data and Test-Time Compute for Accurate RTL Code Generation.arXiv preprint arXiv:2506.05566, 2025.
Liu et al. (2023)Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren.VerilogEval: Evaluating Large Language Models for Verilog Code Generation.InProceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2023. arXiv:2309.07544.
Lu et al. (2024)Yao Lu, Shang Liu, Qijun Zhang, and Zhiyao Xie.RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model.InProceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC), 2024. arXiv:2308.05345.
Liu et al. (2025)Shang Liu, Yao Lu, Wenji Fang, Mengming Li, and Zhiyao Xie.OpenLLM-RTL: Open Dataset and Benchmark for LLM-Aided Design RTL Generation.InProceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2024. arXiv:2503.15112.
Thakur et al. (2024)Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri, and Siddharth Garg.VeriGen: A Large Language Model for Verilog Code Generation.ACM Transactions on Design Automation of Electronic Systems, 2024. arXiv:2308.00708.
Liu et al. (2025)Shang Liu, Wenji Fang, Yao Lu, Jing Wang, Qijun Zhang, Hongce Zhang, and Zhiyao Xie.RTLCoder: Fully Open-Source and Efficient LLM-Assisted RTL Code Generation Technique.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2025. arXiv:2312.08617.
Liu et al. (2023)Mingjie Liu, Teodor-Dumitru Ene, Robert Kirby, Chris Cheng, Nathaniel Pinckney, et al.ChipNeMo: Domain-Adapted LLMs for Chip Design.arXiv preprint arXiv:2311.00176, 2023.
Cui et al. (2024)Fan Cui, Chenyang Yin, Kexing Zhou, Youwei Xiao, Guangyu Sun, et al.OriGen: Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection.arXiv preprint arXiv:2407.16237, 2024.
Liu et al. (2024)Mingjie Liu, Yun-Da Tsai, Wenfei Zhou, and Haoxing Ren.CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair.arXiv preprint arXiv:2409.12993, 2024.
Thakur et al. (2023)Shailja Thakur, Jason Blocklove, Hammond Pearce, Benjamin Tan, Siddharth Garg, and Ramesh Karri.AutoChip: Automating HDL Generation Using LLM Feedback.arXiv preprint arXiv:2311.04887, 2023.
Tsai et al. (2024)Yun-Da Tsai, Mingjie Liu, and Haoxing Ren.RTLFixer: Automatically Fixing RTL Syntax Errors with Large Language Models.InProceedings of the 61st ACM/IEEE Design Automation Conference (DAC), 2024. arXiv:2311.16543.
Ho et al. (2025)Chia-Tung Ho, Haoxing Ren, and Brucek Khailany.VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool.InProceedings of the AAAI Conference on Artificial Intelligence, 2025. arXiv:2408.08927.
Zhao et al. (2025)Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, and Jishen Zhao.MAGE: A Multi-Agent Engine for Automated RTL Code Generation.InProceedings of the 62nd ACM/IEEE Design Automation Conference (DAC), 2025. arXiv:2412.07822.
Huang et al. (2025)Beichen Huang, Ran Cheng, and Kay Chen Tan.EvoGit: Decentralized Code Evolution via Git-Based Multi-Agent Collaboration.arXiv preprint arXiv:2506.02049, 2025.
Wu et al. (2025)Junde Wu, Jiayuan Zhu, and Yuyuan Liu.Git Context Controller: Manage the Context of LLM-based Agents like Git.arXiv preprint arXiv:2508.00031, 2025.
Jimenez et al. (2024)Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan.SWE-bench: Can Language Models Resolve Real-World GitHub Issues?InProceedings of the International Conference on Learning Representations (ICLR), 2024. arXiv:2310.06770.
Aleithan et al. (2024)Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang.SWE-Bench+: Enhanced Coding Benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024.
Wang et al. (2026)You Wang, Michael Pradel, and Zhongxin Liu.Are “Solved Issues” in SWE-bench Really Solved Correctly? An Empirical Study.InProceedings of the International Conference on Software Engineering (ICSE), 2026. Preprint available as arXiv:2503.15223.
pengfeigao1 (2024)pengfeigao1, “Whether using test patch is allowed,”SWE-bench/experiments, GitHub issue #16, Jun. 7, 2024. Accessed: Jun. 23, 2026. [Online]. Available:https://github.com/SWE-bench/experiments/issues/16