StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis
Summary
StepPRM-RTL is a novel framework combining stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to improve LLM-based RTL code generation for Verilog and VHDL, outperforming prior methods by over 10% in functional correctness metrics.
View Cached Full Text
Cached at: 06/05/26, 02:06 AM
# StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis
Source: [https://arxiv.org/html/2606.04246](https://arxiv.org/html/2606.04246)
\(2026\)
###### Abstract\.
Automatic generation of RTL code for digital hardware designs remains challenging due to long\-horizon reasoning, multi\-step dependencies, and strict correctness constraints in Verilog and VHDL\. We present StepPRM\-RTL, a novel framework that combines stepwise trajectory modeling, process\-reward modeling \(PRM\), and retrieval\-augmented fine\-tuning \(RAFT\) to enhance both the functional correctness and reasoning fidelity of LLM\-based RTL code generation\. StepPRM\-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification\. A Process Reward Model \(PRM\) evaluates intermediate steps, providing dense feedback that guides reinforcement\-style updates during RAFT fine\-tuning\. Monte Carlo Tree Search \(MCTS\) explores alternative reasoning paths, enriching the training dataset with high\-quality trajectories\. This integration of stepwise and outcome\-aware rewards allows the model to learn both how and why to construct correct RTL, improving long\-horizon reasoning beyond standard supervised or outcome\-based training\. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM\-RTL outperforms the best prior methods by over 10% in functional correctness and reasoning fidelity metrics\. Ablation studies confirm that the combination of PRM\-guided rewards and stepwise trajectory exploration is key to its performance\. StepPRM\-RTL generalizes across RTL languages and provides a scalable framework for high\-fidelity, interpretable code generation, establishing a new standard for LLM\-assisted hardware design automation\.
RTL code generation, Verilog, VHDL, large language models, reinforcement learning, process reward modeling, stepwise reasoning, Monte Carlo Tree Search, MCTS, LLM, RL, RAFT, retrieval\-augmented fine\-tuning, hardware design automation\.
††journalyear:2026††copyright:cc††conference:63rd ACM/IEEE Design Automation Conference; July 26–29, 2026; Long Beach, CA, USA††booktitle:63rd ACM/IEEE Design Automation Conference \(DAC ’26\), July 26–29, 2026, Long Beach, CA, USA††doi:10\.1145/3770743\.3804218††isbn:979\-8\-4007\-2254\-7/2026/07††ccs:Hardware Hardware description languages and compilation††ccs:Computing methodologies Natural language generation## 1\.Introduction
Automating Register–Transfer Level \(RTL\) code generation remains a central challenge in Electronic Design Automation \(EDA\)\. Unlike general\-purpose programming, RTL demands not only syntactic correctness but also precise temporal, concurrent, and structural behaviors that govern circuit functionality\. A single misaligned state update or improperly gated enable path can propagate across modules, breaking the datapath despite remaining syntactically valid\. Consequently, generating semantically and functionally correct Verilog/VHDL code is both high\-impact and underexplored, with immediate relevance to industrial design productivity\.
Current RTL generation approaches\(Liu et al\.,[2023a](https://arxiv.org/html/2606.04246#bib.bib11); Blocklove et al\.,[2023](https://arxiv.org/html/2606.04246#bib.bib4); Lai et al\.,[2023](https://arxiv.org/html/2606.04246#bib.bib9); Fu et al\.,[2023](https://arxiv.org/html/2606.04246#bib.bib6); Vijayaraghavan et al\.,[2024a](https://arxiv.org/html/2606.04246#bib.bib18)\), primarily rely on supervised learning over code corpora, capturing surface\-level patterns but not the reasoning sequence required to assemble correct control and datapath logic\. Outcome\-driven methods\(Wei et al\.,[2025](https://arxiv.org/html/2606.04246#bib.bib21); Akyash et al\.,[2025](https://arxiv.org/html/2606.04246#bib.bib2)\)evaluate correctness only at the final design level, offering no supervision for intermediate decisions, such as structuring reset logic, aligning control\-path transitions, or coordinating enables across always blocks\. As a result, these models struggle with long\-horizon dependencies and cannot reliably shape multi\-step design trajectories\.
Recent advances in software code generation attempt to address these issues by introducing process reward models \(PRMs\) for intermediate\-step scoring\(Li et al\.,[2025](https://arxiv.org/html/2606.04246#bib.bib10); Ye et al\.,[2025](https://arxiv.org/html/2606.04246#bib.bib23)\)\. However, these PRMs operate at the token level, which is fundamentally mismatched to hardware semantics: meaningful RTL decisions often span statements, modules, and signal groups, making token\-level credit assignment noisy and unstable\. Moreover, structured search techniques such as Monte\-Carlo Tree Search \(MCTS\), widely used in reasoning\-intensive domains\(Kemmerling et al\.,[2024](https://arxiv.org/html/2606.04246#bib.bib8); Świechowski et al\.,[2023](https://arxiv.org/html/2606.04246#bib.bib16)\), remain largely unexplored for RTL synthesis\.
To address these limitations, we proposeStepPRM\-RTL, a reasoning\-aware RTL generation framework that introduces step\-level supervision aligned with hardware semantics\. Each reasoning step consists of an interpretable rationale paired with its corresponding code edit, enabling a Process Reward Model \(StepPRM\) to evaluate choices at the granularity of meaningful RTL behavior\. StepPRM further supports*PRM\-guided MCTS exploration*, where the generator proposes alternative reasoning paths for the same specification, and MCTS evaluates them using step\-level rewards and lightweight synthesizability checks\. This produces a diverse set of high\-value trajectories that extend beyond supervised decompositions while remaining grounded in verifiable hardware logic\. Finally, StepPRM\-RTL integrates these trajectories into a*Retrieval\-Augmented Fine\-Tuning \(RAFT\)*framework\(Zhang et al\.,[2024](https://arxiv.org/html/2606.04246#bib.bib25)\)\. RAFT retrieves canonical reasoning steps from similar designs and uses StepPRM\-based intermediate rewards to stabilize policy refinement\. This integrates step\-level reasoning supervision, structured trajectory exploration, and retrieval\-based context into a single coherent training pipeline, enabling effective long\-horizon RTL code generation\.
Figure 1\.Overall Workflow of StepPRM\-RTL: Training Loop \(Top\) and Inference \(Center\)\.Figure[1](https://arxiv.org/html/2606.04246#S1.F1)summarizes the workflow: \(1\) extract canonical stepwise trajectories; \(2\) expand the reasoning space using StepPRM\-guided MCTS; \(3\) refine StepPRM on the expanded trajectory set; and \(4\) update the generator using RAFT with step\-level rewards\. This iterative loop jointly improves both the policy and the reward model while maintaining semantic alignment with RTL design principles\. Our contributions are summarized as follows:
Step\-Level Process Rewarding for RTL:We introduceStepPRM\-RTL, the first framework to define and score semantically meaningful intermediate reasoning steps for HDL, resolving the mismatch between token\-level scoring and hardware\-level behaviors\.
Unified Reasoning Pipeline:We propose an integrated pipeline combining StepPRM\-guided MCTS exploration with RAFT\-based policy refinement, enabling stable long\-horizon credit assignment and retrieval\-grounded reasoning\.
Comprehensive Evaluation:Extensive experiments on Verilog and VHDL benchmarks demonstrate significant improvements in step\-level reasoning quality, pass@k, functional correctness, and generalization over supervised and reward\-based baselines\.
## 2\.Problem Formulation
We study the task of generating functionally correct Register–Transfer Level \(RTL\) designs from behavioral specifications\. Letxxdenote an input specification \(e\.g\., a natural\-language description of module behavior\) and letc⋆c^\{\\star\}be a corresponding canonical Verilog/VHDL implementation\. Instead of treating HDL generation as flat token prediction, we model RTL construction as a sequence of semantically meaningful*design steps*\. Each step is represented aset=\(rt,δt\)e\_\{t\}=\(r\_\{t\},\\delta\_\{t\}\), wherertr\_\{t\}is a natural\-language rationale describing a hardware design decision \(such as adding synchronous reset or propagating enables\), andδt\\delta\_\{t\}is the code edit applied to the current partial implementation\. Applying edits sequentially produces partial designsc0,c1,…,cTc\_\{0\},c\_\{1\},\\ldots,c\_\{T\}withct=δt\(ct−1\)c\_\{t\}=\\delta\_\{t\}\(c\_\{t\-1\}\), wherec0c\_\{0\}is empty or templated andcTc\_\{T\}is the final design\. A trajectory is thusτ=⟨e1,…,eT⟩\\tau=\\langle e\_\{1\},\\ldots,e\_\{T\}\\rangle, generated by a policy modelπθ\(et∣x,ct−1\)\\pi\_\{\\theta\}\(e\_\{t\}\\mid x,c\_\{t\-1\}\)that conditions on both the specification and the evolving design state\.
To supervise intermediate reasoning quality, we define a Step\-level Process Reward Model \(StepPRM\)Vϕ\(et,ct−1,x\)V\_\{\\phi\}\(e\_\{t\},c\_\{t\-1\},x\)that assigns a semantic score to each step, reflecting structural correctness, consistency with RTL design intent, and alignment with hardware semantics\. Final correctness is assessed by an outcome rewardRout\(cT\)R\_\{\\mathrm\{out\}\}\(c\_\{T\}\)based on compilation, simulation, and testbench verification\. In this work, we leverage two types of datasets: an in\-house RTL\-IR corpus comprising combinations of spec, code and summary and derived stepwise trajectories, used to train bothπθ\\pi\_\{\\theta\}andVϕV\_\{\\phi\}; andVerilog\-EvalandVHDL\-Eval, two strictly held\-out benchmarks used only for evaluation of generalization and functional correctness\.
### 2\.1\.Objective
Our goal is to learn a reasoning\-aware policy that produces high\-quality trajectories with sound intermediate decisions and correct final implementations\. Formally, we maximize the expected trajectory value:maxθ𝔼τ∼πθ\(⋅∣x\)\[𝒱\(τ\)\],\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[\\mathcal\{V\}\(\\tau\)\\right\],where the value of a trajectory combines step\-level and outcome\-level rewards:
\(1\)𝒱\(τ\)=α∑t=1TVϕ\(et,ct−1,x\)\+βRout\(cT\),\\mathcal\{V\}\(\\tau\)=\\alpha\\sum\_\{t=1\}^\{T\}V\_\{\\phi\}\(e\_\{t\},c\_\{t\-1\},x\)\+\\beta\\,R\_\{\\mathrm\{out\}\}\(c\_\{T\}\),withα\\alphaandβ\\betaweighting semantic reasoning quality and final functionality\. This formulation provides dense, hardware\-aligned supervision for long\-horizon RTL construction\.
## 3\.Methodology: StepPRM\-RTL Framework
### 3\.1\.Overview
We proposeStepPRM\-RTL111Short for Step\-level Process Reward Model for RTL Synthesis, an RL\-guided framework for generating correct RTL designs from natural\-language or structured specifications\. StepPRM\-RTL models RTL generation as a sequence of semantically meaningful design steps and integrates four tightly coupled components: \(i\)*stepwise trajectory construction*from canonical RTL code to obtain high\-quality reasoning demonstrations, \(ii\) a*Step\-level Process Reward Model \(StepPRM\)*VϕV\_\{\\phi\}that assigns semantic scores to intermediate design decisions, \(iii\)*PRM\-guided Monte\-Carlo Tree Search \(MCTS\)*to explore alternative reasoning paths and collect diverse, high\-value trajectories, and \(iv\)*retrieval\-augmented fine\-tuning \(RAFT\)*that refines the generation policyπθ\\pi\_\{\\theta\}using retrieved trajectories combined with StepPRM rewards\. The framework operates as an iterative loop: canonical RTL implementations are first decomposed into stepwise trajectories to bootstrap the StepPRM\. The initial PRM guides MCTS exploration to generate diverse, high\-value reasoning trajectories beyond the canonical examples\. These trajectories are then used to refine the PRM, improving its ability to assign semantically meaningful intermediate rewards\. Finally, the generation policyπθ\\pi\_\{\\theta\}is updated through RAFT fine\-tuning, combining retrieved trajectory context with StepPRM scores to reinforce correct intermediate reasoning\. This loop, consisting of trajectory collection, followed by PRM refinement, and then policy updates, repeats until convergence, ensuring continuous improvement of both the reward model and the generator\. By unifying interpretable step supervision, structured exploration, and reward\-guided policy refinement, StepPRM\-RTL enhances long\-horizon reasoning fidelity and final RTL correctness\.
### 3\.2\.Stepwise Trajectory Construction
Table 1\.Stepwise decomposition of a 2\-bit counter, pairing each rationale with its code edit\.The first component of StepPRM\-RTL is*Stepwise Trajectory Construction*, which decomposes canonical RTL implementations into semantically meaningful intermediate design steps\. Given a specificationxxand its canonical RTL implementationc⋆c^\{\\star\}\(Verilog or VHDL\), we generate a stepwise trajectory,τ=⟨e1,e2,…,eT⟩\\tau=\\langle e\_\{1\},e\_\{2\},\\dots,e\_\{T\}\\rangle, whereet=\(st,δt\)e\_\{t\}=\(s\_\{t\},\\delta\_\{t\}\), with each stepete\_\{t\}pairing a human\- or model\-generated rationale statementsts\_\{t\}and the corresponding code editδt\\delta\_\{t\}applied to a partial implementationct−1c\_\{t\-1\}\. The final partial implementationcTc\_\{T\}should reconstructc⋆c^\{\\star\}\.
In practice, this decomposition leverages large language models \(LLMs\) to propose rationales and code edits, optionally assisted by abstract syntax tree \(AST\) analysis to ensure syntactic and structural consistency\. This produces high\-quality, interpretable demonstrations\(x,τ\)\(x,\\tau\), which provide supervised training pairs for initializing the generation policyπθ\\pi\_\{\\theta\}and bootstrapping the Step\-level Process Reward Model \(StepPRM\)\. Unlike token\-level PRMs, StepPRM learns to assign semantic rewards to entire steps, improving stability and credit assignment in downstream reinforcement learning\. Table[1](https://arxiv.org/html/2606.04246#S3.T1)illustrates the decomposition of a simple 2\-bit counter design into a stepwise trajectory\. Collectively, these trajectories form the initial training pool𝒟0\\mathcal\{D\}\_\{0\}, used to initialize both StepPRMVϕV\_\{\\phi\}and the supervised policyπθ\\pi\_\{\\theta\}\. This bootstrap phase establishes dense, step\-level supervision that enables subsequent iterative exploration and reward\-guided policy refinement\.
### 3\.3\.Step\-Level Process Reward Model \(StepPRM\)
Given the bootstrapped trajectory dataset𝒟0=\{\(x,τ\)\}\\mathcal\{D\}\_\{0\}=\\\{\(x,\\tau\)\\\}, the goal of the Step\-Level Process Reward Model \(StepPRM\) is to learn a functionVϕ:\(x,e1:t\)↦ℝ,V\_\{\\phi\}:\(x,e\_\{1:t\}\)\\mapsto\\mathbb\{R\},which assigns a scalar reward to each intermediate stepete\_\{t\}conditioned on the specificationxxand the partial reasoning trajectorye1:t=⟨e1,…,et⟩e\_\{1:t\}=\\langle e\_\{1\},\\dots,e\_\{t\}\\rangle\. Unlike token\-level reward models, StepPRM operates at the*semantic step*granularity, enabling stable credit assignment over long\-horizon RTL synthesis trajectories\.
#### 3\.3\.1\.Supervised Preference Learning from Canonical Trajectories
Each stepwise trajectoryτ=⟨e1,…,eT⟩\\tau=\\langle e\_\{1\},\\dots,e\_\{T\}\\rangleobtained from canonical decomposition corresponds to a sequence of high\-quality intermediate decisions\. For StepPRM training, we treat each canonical stepet⋆e\_\{t\}^\{\\star\}as preferable to perturbed or low\-quality stepse~t\\tilde\{e\}\_\{t\}generated by the model or obtained via syntactic mutations\. For each training instance, we form a pair:\(\(x,e1:t−1,et⋆\),\(x,e1:t−1,e~t\)\),\\bigl\(\(x,e\_\{1:t\-1\},e\_\{t\}^\{\\star\}\),\\,\(x,e\_\{1:t\-1\},\\tilde\{e\}\_\{t\}\)\\bigr\),with the preference labelet⋆≻e~te\_\{t\}^\{\\star\}\\succ\\tilde\{e\}\_\{t\}\. StepPRM is trained using the standard preference\-ranking objective \(Bradley–Terry / logistic preference model\) widely used in reward modeling and RLHF\(Christiano et al\.,[2017](https://arxiv.org/html/2606.04246#bib.bib5); Ouyang et al\.,[2022](https://arxiv.org/html/2606.04246#bib.bib15)\):
ℒPRM=−𝔼\[logσ\(Vϕ\(x,e1:t−1,et⋆\)−Vϕ\(x,e1:t−1,e~t\)\)\],\\mathcal\{L\}\_\{\\mathrm\{PRM\}\}=\-\\mathbb\{E\}\\\!\\left\[\\log\\sigma\\\!\\left\(V\_\{\\phi\}\(x,e\_\{1:t\-1\},e\_\{t\}^\{\\star\}\)\-V\_\{\\phi\}\(x,e\_\{1:t\-1\},\\tilde\{e\}\_\{t\}\)\\right\)\\right\],whereσ\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\. This objective encourages StepPRM to assign higher reward to semantically correct steps and penalize structurally invalid or logically incoherent ones\.
#### 3\.3\.2\.Reward Shaping via Partial Rollout Consistency
Because RTL code is only verifiable when complete, StepPRM must infer step quality without explicit functional simulation\. To address this, we introduce a consistency\-based shaping term that leverages the structural alignment between partial implementationsctc\_\{t\}and the final canonical codec⋆c^\{\\star\}\. LetA\(ct,c⋆\)A\(c\_\{t\},c^\{\\star\}\)denote an alignment score computed via AST tree\-edit similarity or structural matching\. We define the shaped reward target for stepete\_\{t\}as:
yt=α⋅𝟏\[etfrom canonical\]\+\(1−α\)⋅A\(ct,c⋆\),y\_\{t\}=\\alpha\\cdot\\mathbf\{1\}\[e\_\{t\}\\text\{ from canonical\}\]\+\(1\-\\alpha\)\\cdot A\(c\_\{t\},c^\{\\star\}\),whereα∈\[0,1\]\\alpha\\in\[0,1\]controls the balance between canonical supervision and structure\-aware shaping\. Intuitively, the indicator𝟏\[⋅\]\\mathbf\{1\}\[\\cdot\]provides a strong discrete signal when a step matches the canonical trace, whileA\(⋅,⋅\)A\(\\cdot,\\cdot\)supplies a continuous proxy for partial correctness when the step is novel or partially aligned\. StepPRM is additionally trained to regress to this shaped reward:
ℒshaping=𝔼\[\(Vϕ\(x,e1:t\)−yt\)2\],\\mathcal\{L\}\_\{\\mathrm\{shaping\}\}=\\mathbb\{E\}\\\!\\left\[\\bigl\(V\_\{\\phi\}\(x,e\_\{1:t\}\)\-y\_\{t\}\\bigr\)^\{2\}\\right\],which aids calibration and provides denser gradients for generalization\.
#### 3\.3\.3\.Final Training Objective
The full StepPRM loss combines preference learning and shaping:
ℒStepPRM=ℒPRM\+λshℒshaping,\\mathcal\{L\}\_\{\\mathrm\{StepPRM\}\}=\\mathcal\{L\}\_\{\\mathrm\{PRM\}\}\+\\lambda\_\{\\mathrm\{sh\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{shaping\}\},whereλsh\\lambda\_\{\\mathrm\{sh\}\}weights the shaping term\. This composite objective ensures that StepPRM captures both*relative preference structure*between reasoning steps and*absolute semantic quality*measured by partial structural consistency\.
#### 3\.3\.4\.Reward Assignment During Rollouts
At inference or during MCTS\-guided exploration, StepPRM assigns a reward:rt=Vϕ\(x,e1:t\)r\_\{t\}=V\_\{\\phi\}\(x,e\_\{1:t\}\)to each newly generated candidate step\. These stepwise rewards provide dense, semantically grounded feedback, enabling efficient exploration and mitigating long\-horizon credit assignment issues that commonly arise in RTL synthesis tasks\.
### 3\.4\.PRM\-Guided MCTS
To explore alternative reasoning paths beyond the canonical trajectories, we employ a PRM\-guided Monte Carlo Tree Search \(MCTS\)\. In contrast to standard RLHF pipelines that apply rewards only after full\-sequence generation, MCTS enables structured, branching exploration over partial RTL implementations\. StepPRM provides dense, step\-level evaluations that guide tree expansion, similar in spirit to value\-guided planning in AlphaZero\-style search\(Wan et al\.,[2024](https://arxiv.org/html/2606.04246#bib.bib20)\)but adapted to the semantics of RTL synthesis\.
#### 3\.4\.1\.Search Tree Structure
For a given specificationxx, MCTS constructs a search tree where each node represents a partial reasoning prefixe1:te\_\{1:t\}and each edge corresponds to an operator\-level or statement\-level RTL stepet\+1e\_\{t\+1\}\. Each node maintains:Node\(e1:t\)=\(Nt,Qt,\{a\},\{Pt\(a\)\}\),\\text\{Node\}\(e\_\{1:t\}\)=\\bigl\(N\_\{t\},\\,Q\_\{t\},\\,\\\{a\\\},\\,\\\{P\_\{t\}\(a\)\\\}\\bigr\),whereNtN\_\{t\}is the visit count,QtQ\_\{t\}is the accumulated step\-value estimate, andPt\(a\)P\_\{t\}\(a\)is the policy prior supplied by the current generation modelπθ\(a∣x,e1:t\)\\pi\_\{\\theta\}\(a\\mid x,e\_\{1:t\}\)\.
#### 3\.4\.2\.StepPRM\-Guided UCB Score
During tree traversal, MCTS selects the next step by maximizing an Upper Confidence Bound \(UCB\) objective:
a⋆=argmaxa\[Qt\(a\)\+cuct⋅Pt\(a\)∑bNt\(b\)1\+Nt\(a\)\],a^\{\\star\}=\\arg\\max\_\{a\}\\left\[Q\_\{t\}\(a\)\+c\_\{\\mathrm\{uct\}\}\\cdot P\_\{t\}\(a\)\\frac\{\\sqrt\{\\sum\_\{b\}N\_\{t\}\(b\)\}\}\{1\+N\_\{t\}\(a\)\}\\right\],wherecuctc\_\{\\mathrm\{uct\}\}controls exploration\. Unlike typical MCTS whereQt\(a\)Q\_\{t\}\(a\)is backed up solely from terminal rewards, we initializeQt\(a\)←Vϕ\(x,e1:t,a\)Q\_\{t\}\(a\)\\leftarrow V\_\{\\phi\}\(x,e\_\{1:t\},a\), directly using StepPRM\. This provides dense, semantic feedback even for partial code, preventing uninformative plateaus common in long\-horizon synthesis tasks\.
#### 3\.4\.3\.Rollout Expansion and PRM Value Backup
When expanding a leaf node, the partial trajectory is extended using the policy modelπθ\\pi\_\{\\theta\}until a horizon depthHHor an early structural stopping criterion \(e\.g\., balanced begin–end blocks\) is met\. StepPRM evaluates each new step, and the leaf value is computed as:Rleaf=1k∑i=1kVϕ\(x,e1:t\+i\),R\_\{\\mathrm\{leaf\}\}=\\frac\{1\}\{k\}\\sum\_\{i=1\}^\{k\}V\_\{\\phi\}\(x,e\_\{1:t\+i\}\),i\.e\., the average step\-level semantic value across the rollout\. This value is backed up through the tree:
Qt\(a\)←Nt\(a\)⋅Qt\(a\)\+RleafNt\(a\)\+1,Nt\(a\)←Nt\(a\)\+1\.Q\_\{t\}\(a\)\\leftarrow\\frac\{N\_\{t\}\(a\)\\cdot Q\_\{t\}\(a\)\+R\_\{\\mathrm\{leaf\}\}\}\{N\_\{t\}\(a\)\+1\},\\quad N\_\{t\}\(a\)\\leftarrow N\_\{t\}\(a\)\+1\.
#### 3\.4\.4\.Balancing Exploration and Structural Feasibility
To prevent exploration of syntactically invalid branches, MCTS performs feasibility checks on partial codectc\_\{t\}\(e\.g\., unmatched always\-blocks, undeclared signals, combinational cycles\)\. Branches that violate structural invariants are discarded and assigned a large negative StepPRM penalty via:Vϕ\(x,e1:t\)←−β,V\_\{\\phi\}\(x,e\_\{1:t\}\)\\leftarrow\-\\beta,whereβ\\betais a large constant\. This tightens the search space and improves sample efficiency\.
#### 3\.4\.5\.Search Output and Trajectory Aggregation
AfterMMsimulations, the improved policy for each state is given by normalized visit counts:π^\(a∣x,e1:t\)=Nt\(a\)τ∑bNt\(b\)τ,\\hat\{\\pi\}\(a\\mid x,e\_\{1:t\}\)=\\frac\{N\_\{t\}\(a\)^\{\\tau\}\}\{\\sum\_\{b\}N\_\{t\}\(b\)^\{\\tau\}\},whereτ\\tauis a temperature parameter\. The top\-ranked rollouts form an expanded dataset:𝒟mcts=\{\(x,τ^\)\},\\mathcal\{D\}\_\{\\mathrm\{mcts\}\}=\\\{\(x,\\hat\{\\tau\}\)\\\},whereτ^\\hat\{\\tau\}is a high\-reward trajectory under StepPRM\. These trajectories include novel, semantically consistent reasoning paths that go beyond canonical demonstrations, reducing bootstrap bias and stabilizing subsequent policy refinement \(RAFT\)\.
### 3\.5\.Retrieval\-Augmented Fine\-Tuning \(RAFT\)
After StepPRM\-guided MCTS expands the trajectory space, the final component of StepPRM\-RTL is Retrieval\-Augmented Fine\-Tuning \(RAFT\), which refines the generation policyπθ\\pi\_\{\\theta\}using \(i\) retrieved repository\-level context, and \(ii\) high\-quality trajectories weighted by StepPRM\-derived rewards\. RAFT integrates retrieval\-based grounding with reward\-weighted policy optimization, enabling the policy to internalize both semantic reasoning structure and hardware\-specific design patterns\.
#### 3\.5\.1\.Retrieval Model for Repository\-Level Context
Given a specificationxx, RAFT retrieves relevant RTL files, design patterns, module templates, or prior verified trajectories from a repositoryℛ\\mathcal\{R\}\. We encode each repository elementd∈ℛd\\in\\mathcal\{R\}using a domain\-tuned encoderg\(⋅\)g\(\\cdot\)and compute similarity with the query encodingq\(x\)q\(x\):s\(d∣x\)=sim\(q\(x\),g\(d\)\)s\(d\\mid x\)=\\text\{sim\}\(q\(x\),g\(d\)\)Top\-kkdocuments,𝒞\(x\)=\{d1,…,dk\}\\mathcal\{C\}\(x\)=\\\{d\_\{1\},\\dots,d\_\{k\}\\\}, are retrieved & concatenated with the trajectory prefix for conditioning\.
Table 2\.Overall performance ofStepPRM\-RTLcompared to baselines\. StepPRM\-RTL achieves the highest Pass@1 and reasoning fidelity on both Verilog and VHDL benchmarks, demonstrating superior functional correctness and stepwise reasoning quality\.
#### 3\.5\.2\.Reward\-Weighted Trajectories
For each high\-value MCTS trajectoryτ^=⟨e1,…,eT⟩\\hat\{\\tau\}=\\langle e\_\{1\},\\dots,e\_\{T\}\\rangle, StepPRM provides stepwise rewards:rt=Vϕ\(x,e1:t\)\.r\_\{t\}=V\_\{\\phi\}\(x,e\_\{1:t\}\)\.We compute a normalized trajectory\-level weight:
w\(τ^\)=exp\(β∑t=1Trt\)∑τ′∈𝒟exp\(β∑trt′\),w\(\\hat\{\\tau\}\)=\\frac\{\\exp\\\!\\left\(\\beta\\sum\_\{t=1\}^\{T\}r\_\{t\}\\right\)\}\{\\sum\_\{\\tau^\{\\prime\}\\in\\mathcal\{D\}\}\\exp\\\!\\left\(\\beta\\sum\_\{t\}r^\{\\prime\}\_\{t\}\\right\)\},whereβ\\betacontrols reward sensitivity\. This weighting mechanism resembles advantage\-weighted regression in RL fine\-tuning and preference\-based policy optimization, but adapted to step\-level rewards and trajectory supervision\.
#### 3\.5\.3\.Policy Update
The policy is fine\-tuned to maximize the likelihood of high\-value trajectories given the retrieved context:
ℒRAFT=−𝔼\(x,τ^\)∼𝒟mcts\[w\(τ^\)∑t=1Tlogπθ\(et∣x,𝒞\(x\),e1:t−1\)\]\.\\mathcal\{L\}\_\{\\mathrm\{RAFT\}\}=\-\\mathbb\{E\}\_\{\(x,\\hat\{\\tau\}\)\\sim\\mathcal\{D\}\_\{\\mathrm\{mcts\}\}\}\\left\[w\(\\hat\{\\tau\}\)\\sum\_\{t=1\}^\{T\}\\log\\pi\_\{\\theta\}\(e\_\{t\}\\mid x,\\mathcal\{C\}\(x\),e\_\{1:t\-1\}\)\\right\]\.This objective encourages the model to reproduce high\-reward decision sequences while grounding them in repository\-level retrieved signals\. Compared to standard supervised fine\-tuning, RAFT introduces two key improvements: \(a\) retrieval grounding, which exposes the policy to reusable structural patterns and contextually relevant design idioms drawn from existing repositories; and \(b\) reward weighting, which prioritizes trajectories that StepPRM and MCTS jointly deem semantically consistent and structurally valid\.
#### 3\.5\.4\.Iterative Integration with StepPRM and MCTS
RAFT closes the StepPRM\-RTL loop\. After each RAFT update, which is:πθ\(k\+1\)←RAFT\(πθ\(k\)\)\\pi\_\{\\theta\}^\{\(k\+1\)\}\\leftarrow\\text\{RAFT\}\\bigl\(\\pi\_\{\\theta\}^\{\(k\)\}\\bigr\), the improved policy becomes the proposal distribution for the next iteration\. This yields higher\-quality search rollouts, which in turn allow StepPRM to refine reward estimates from a more diverse and semantically richer trajectory distribution\.
### 3\.6\.Implementation Details
We implementStepPRM\-RTLin PyTorch\(Imambi et al\.,[2021](https://arxiv.org/html/2606.04246#bib.bib7)\), fine\-tuning Qwen3\-8B\-Instruct\(Yang et al\.,[2025](https://arxiv.org/html/2606.04246#bib.bib22)\)on stepwise trajectories using 2–4 NVIDIA A100 GPUs\. StepPRM takes as input the concatenation of the spec, partial code state, and current step \(rationale \+ edit\), encoded via a transformer with a scalar regression head\. Retrieval uses Qwen3\-Embedding\-4B\(Zhang et al\.,[2025](https://arxiv.org/html/2606.04246#bib.bib26)\)trained via contrastive learning on HDL repositories, with top\-kkmatches prepended during RAFT fine\-tuning\. Structured exploration is performed via MCTS\-guided by StepPRM, using 50 simulations per specification, an exploration constantcuct=1\.5c\_\{\\mathrm\{uct\}\}=1\.5, and a rollout horizon of 10 steps\. StepPRM rewards are combined with a structural alignment term \(λsh=0\.5\\lambda\_\{\\mathrm\{sh\}\}=0\.5\) for reward shaping\. Outcome verification employs Icarus Verilog for Verilog and GHDL\+VUnit for VHDL, though StepPRM scores primarily guide MCTS\. The training pipeline first pretrains the policy and StepPRM on canonical trajectories, expands the trajectory space via StepPRM\-guided MCTS, and refines the policy with reward\-weighted RAFT using retrieved context, iteratively improving both policy and reward model\.
## 4\.Experiments
We evaluateStepPRM\-RTLon RTL synthesis using two benchmarks: Verilog\-Eval\(Liu et al\.,[2023b](https://arxiv.org/html/2606.04246#bib.bib12)\)\(156 spec\-to\-Verilog tasks with self\-checking testbenches from HDLBits\) and VHDL\-Eval\(Vijayaraghavan et al\.,[2024b](https://arxiv.org/html/2606.04246#bib.bib19)\)\(202 translated VHDL tasks with similar verification\)\. We augment both with LLM\-generated, stepwise rationales validated via intermediate checks, enabling evaluation of correctness and reasoning\. We compare against finetuned baselines—VeriThoughts\(Yubeaton et al\.,[2025](https://arxiv.org/html/2606.04246#bib.bib24)\),Verigen\(Thakur et al\.,[2024](https://arxiv.org/html/2606.04246#bib.bib17)\),RTLCoder\(Liu et al\.,[2024](https://arxiv.org/html/2606.04246#bib.bib13)\),CodeV\(Zhao et al\.,[2024](https://arxiv.org/html/2606.04246#bib.bib27)\)and the VHDL baselineCoDes\(Vijayaraghavan et al\.,[2024a](https://arxiv.org/html/2606.04246#bib.bib18)\)\. We also evaluate strong RAG\-enabled LLM baselines:GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2606.04246#bib.bib14)\)andQwen3\-8B\(RAG, no PRM/MCTS\)\(Bai et al\.,[2023](https://arxiv.org/html/2606.04246#bib.bib3)\)\. ll models are evaluated on two primary metrics:*Pass@1*, computed using the official testbenches, and*reasoning fidelity*, measured by an LLM judge comparing generated reasoning trajectories against canonical reasoning steps in each benchmark\. Our experiments address two research questions:RQ1:How doesStepPRM\-RTLcompare to state\-of\-the\-art baselines on Verilog/VHDL synthesis?RQ2:What is the contribution of each pipeline component?
## 5\.Results
### 5\.1\.RQ1: Overall Results
StepPRM\-RTLachieves the highest Pass@1 and reasoning fidelity on both Verilog and VHDL benchmarks, as shown in Table[2](https://arxiv.org/html/2606.04246#S3.T2)\. Pass@1 measures the probability that the first generated implementation passes the functional testbench, while reasoning fidelity quantifies how closely the model’s intermediate rationales align with ground\-truth stepwise reasoning in the benchmark\. StepPRM\-RTL consistently outperforms prompt\-based and finetuning\-based baselines, with 0\.857 and 0\.786 Pass@1 on Verilog and VHDL, respectively, and reasoning fidelity exceeding 80%\. Compared to RAG\-FT, StepPRM\-RTL leverages dense StepPRM rewards and MCTS exploration to improve intermediate reasoning and final correctness\. Prompt\-based models \(Vanilla, CoDes\) lag due to no trajectory supervision, while finetuned models show moderate gains, underscoring the value of reward\-guided trajectory learning\.
### 5\.2\.RQ2: Ablation Studies
To quantify the contribution of each StepPRM\-RTL component, we conduct ablation experiments on MCTS, the StepPRM, and reward\-based RAFT\. Results are reported in Table[2](https://arxiv.org/html/2606.04246#S3.T2)\.
Impact of MCTS SearchWe disable PRM\-guided MCTS and replace it with sampling\-only rollouts, generatingK=20K=20candidate trajectories per specification\. These trajectories are fed into RAFT fine\-tuning, retaining StepPRM\-based reward weighting\. Without structured MCTS, Pass@1 decreases from 0\.857 to 0\.810 on Verilog and 0\.786 to 0\.738 on VHDL \(≈\\approx4\.7–5\.0 pp drop\), while reasoning fidelity drops by 4–4\.5 pp\. This confirms that MCTS is critical for selecting high\-quality intermediate steps, reducing invalid rollouts, and effectively exploring diverse reasoning paths that naive sampling cannot cover\.
Outcome vs\. Process RewardsTo isolate the effect of the StepPRM, we remove step\-level reward supervision and rely solely on outcome\-based verification \(i\.e\., functional correctness checked via Icarus Verilog for Verilog, GHDL/VUnit for VHDL\)\. The same MCTS and RAFT pipeline is retained\. Removing PRM leads to Pass@1 dropping from 0\.857 to 0\.781 \(≈\\approx7\.6 pp\) on Verilog and 0\.786 to 0\.709 on VHDL, with reasoning fidelity falling from 82\.5% to 73\.1% and 80\.2% to 70\.8%, respectively\. These results indicate that outcome\-only rewards, even with simulator/formal verification, provide sparse feedback insufficient for guiding intermediate step\-level reasoning\. StepPRM supplies dense, interpretable rewards, improving both long\-horizon reasoning and trajectory quality\.
Influence of Reward\-Based RAFT \(Supervised\-Only RAFT\)We also evaluate RAFT fine\-tuning without reward weighting, i\.e\., treating all high\-value trajectories equally regardless of StepPRM scores\. In this configuration, Pass@1 drops from 0\.857 to 0\.796 on Verilog and 0\.786 to 0\.721 on VHDL \(≈\\approx6 pp\), while reasoning fidelity decreases by 7–8 pp\. This demonstrates that reward\-guided RAFT is necessary to prioritize semantically high\-quality steps and not just reproduce trajectory sequences\. Thus, these ablation studies demonstrate that each StepPRM\-RTL component is crucial: MCTS enables structured exploration and reduces invalid rollouts, PRM provides dense step\-level rewards for intermediate reasoning, and reward\-weighted RAFT consolidates high\-quality trajectories into the policy\. Removing any component degrades both functional correctness and reasoning fidelity, validating the design decisions of our framework\.
### 5\.3\.Hyperparameter Sensitivity Analysis
We analyze the impact of two critical hyperparameters on StepPRM\-RTL performance: the number of MCTS simulations per specification \(NsimN\_\{\\text\{sim\}\}\) and the reward shaping weight \(λsh\\lambda\_\{\\text\{sh\}\}\)\. Figure[2](https://arxiv.org/html/2606.04246#S5.F2)shows Pass@1 results on Verilog and VHDL benchmarks for both hyperparameters\.MCTS Simulation Count:IncreasingNsimN\_\{\\text\{sim\}\}improves Pass@1, rising from0\.780\.78to0\.8570\.857for Verilog and0\.720\.72to0\.7860\.786for VHDL as simulations increase from55to2525\. Notably,Nsim=15N\_\{\\text\{sim\}\}=15achieves nearly the same performance as2020–2525simulations, offering a favorable tradeoff between accuracy and computational cost\. Gains plateau beyond1515simulations, suggesting that StepPRM effectively prioritizes high\-value steps\.Reward Shaping Weight:Pass@1 peaks atλsh=0\.3\\lambda\_\{\\text\{sh\}\}=0\.3, striking a balance between canonical step preference and structural alignment\. Lower values underweight structural guidance, while higher values \(λsh≥0\.5\\lambda\_\{\\text\{sh\}\}\\geq 0\.5\) overemphasize alignment, occasionally penalizing creative yet correct steps\. The trend is consistent across Verilog and VHDL benchmarks, confirming robustness across architectures\. Overall, StepPRM\-RTL performance is stable for moderate MCTS simulations and shaping weights, providing a practical tradeoff between exploration, step\-level guidance, and computational cost\.
Figure 2\.Hyperparameter sensitivity for StepPRM\-RTL\. Left: Pass@1 vs\. MCTS simulations \(NsimN\_\{\\text\{sim\}\}\)\. Right: Pass@1 vs\. shaping weight \(λsh\\lambda\_\{\\text\{sh\}\}\)\. Best performance atNsim=15N\_\{\\text\{sim\}\}=15andλsh=0\.3\\lambda\_\{\\text\{sh\}\}=0\.3\.
## 6\.Conclusion
We introducedStepPRM\-RTL, an RL\-guided framework for RTL synthesis that integrates stepwise trajectory decomposition, a Step\-Level Process Reward Model \(StepPRM\), PRM\-guided MCTS, and retrieval\-augmented fine\-tuning \(RAFT\)\. By modeling RTL generation as a sequence of semantically meaningful steps with dense, interpretable rewards, StepPRM\-RTL effectively addresses long\-horizon credit assignment and ensures both intermediate reasoning fidelity and final functional correctness\. Experiments on Verilog\-Eval and VHDL\-Eval benchmarks show that StepPRM\-RTL outperforms prompt\-based, fine\-tuned, and retrieval\-augmented LLM baselines, achieving state\-of\-the\-art Pass@1 and reasoning fidelity\. Ablation studies highlight the critical role of structured MCTS exploration, step\-level rewards, and reward\-weighted RAFT in improving trajectory quality and reasoning\. Future directions include extending the framework to multi\-file hierarchical designs, integrating formal verification more tightly into the reward model, and exploring cross\-architecture transfer of reasoning trajectories\. Overall, StepPRM\-RTL bridges interpretable stepwise reasoning with improved RTL synthesis, providing a promising foundation for AI\-assisted hardware design\.
## References
- \(1\)
- Akyash et al\.\(2025\)Mohammad Akyash, Kimia Azar, and Hadi Kamali\. 2025\.Rtl\+\+: Graph\-enhanced llm for rtl code generation\.*arXiv preprint arXiv:2505\.13479*\(2025\)\.
- Bai et al\.\(2023\)Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al\.2023\.Qwen technical report\.*arXiv preprint arXiv:2309\.16609*\(2023\)\.
- Blocklove et al\.\(2023\)Jason Blocklove, Siddharth Garg, Ramesh Karri, and Hammond Pearce\. 2023\.Chip\-Chat: Challenges and Opportunities in Conversational Hardware Design\. In*2023 ACM/IEEE 5th Workshop on Machine Learning for CAD \(MLCAD\)*\. IEEE, 1–6\.
- Christiano et al\.\(2017\)Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei\. 2017\.Deep reinforcement learning from human preferences\.*Advances in neural information processing systems*30 \(2017\)\.
- Fu et al\.\(2023\)Yonggan Fu, Yongan Zhang, Zhongzhi Yu, Sixu Li, Zhifan Ye, Chaojian Li, Cheng Wan, and Yingyan Celine Lin\. 2023\.GPT4AIGChip: Towards Next\-Generation AI Accelerator Design Automation via Large Language Models\. In*2023 IEEE/ACM International Conference on Computer\-Aided Design \(ICCAD\)*\. IEEE\.
- Imambi et al\.\(2021\)Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachidambaresan\. 2021\.PyTorch\.In*Programming with TensorFlow: solution for edge computing applications*\. Springer, 87–104\.
- Kemmerling et al\.\(2024\)Marco Kemmerling, Daniel Lütticke, and Robert H Schmitt\. 2024\.Beyond games: a systematic review of neural Monte Carlo tree search applications\.*Applied Intelligence*54, 1 \(2024\), 1020–1046\.
- Lai et al\.\(2023\)Yao Lai, Jinxin Liu, Zhentao Tang, Bin Wang, Jianye Hao, and Ping Luo\. 2023\.ChiPFormer: transferable chip placement via offline decision transformer\. In*Proceedings of the 40th International Conference on Machine Learning \(ICML\)*, Vol\. 202\. PMLR, 18346–18364\.
- Li et al\.\(2025\)Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu\. 2025\.CodePRM: Execution Feedback\-enhanced Process Reward Model for Code Generation\. In*Findings of the Association for Computational Linguistics: ACL 2025*\.
- Liu et al\.\(2023a\)Mingjie Liu, Teodor\-Dumitru Ene, Robert Kirby, Chris Cheng, Nathaniel Pinckney, Rongjian Liang, Jonah Alben, Himyanshu Anand, Sanmitra Banerjee, Ismet Bayraktaroglu, et al\.2023a\.Chipnemo: Domain\-adapted llms for chip design\.*arXiv preprint arXiv:2311\.00176*\(2023\)\.
- Liu et al\.\(2023b\)Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren\. 2023b\.VerilogEval: Evaluating Large Language Models for Verilog Code Generation\. In*2023 IEEE/ACM International Conference on Computer\-Aided Design \(ICCAD\)*\. IEEE\.
- Liu et al\.\(2024\)Shang Liu, Wenji Fang, Yao Lu, Jing Wang, Qijun Zhang, Hongce Zhang, and Zhiyao Xie\. 2024\.RTLCoder: Fully open\-source and efficient LLM\-assisted RTL code generation technique\.*IEEE Transactions on Computer\-Aided Design of Integrated Circuits and Systems*\(2024\)\.
- OpenAI \(2024\)OpenAI\. 2024\.ChatGPT\.[https://chatgpt\.com/](https://chatgpt.com/)\.
- Ouyang et al\.\(2022\)Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.2022\.Training language models to follow instructions with human feedback\.*Advances in neural information processing systems*35 \(2022\), 27730–27744\.
- Świechowski et al\.\(2023\)Maciej Świechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Mańdziuk\. 2023\.Monte Carlo tree search: A review of recent modifications and applications\.*Artificial Intelligence Review*56, 3 \(2023\), 2497–2562\.
- Thakur et al\.\(2024\)Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan\-Gavitt, Ramesh Karri, and Siddharth Garg\. 2024\.VeriGen: A Large Language Model for Verilog Code Generation\.*ACM Transactions on Design Automation of Electronic Systems \(TODAES\)*29, 3 \(2024\), 1–31\.
- Vijayaraghavan et al\.\(2024a\)Prashanth Vijayaraghavan, Apoorva Nitsure, Charles Mackin, Luyao Shi, Stefano Ambrogio, Arvind Haran, Viresh Paruthi, Ali Elzein, Dan Coops, David Beymer, et al\.2024a\.Chain\-of\-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization\. In*Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD*\. 1–10\.
- Vijayaraghavan et al\.\(2024b\)Prashanth Vijayaraghavan, Luyao Shi, Stefano Ambrogio, Charles Mackin, Apoorva Nitsure, David Beymer, and Ehsan Degan\. 2024b\.VHDL\-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation\.*arXiv preprint arXiv:2406\.04379*\(2024\)\.
- Wan et al\.\(2024\)Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus Mcaleer, Ying Wen, Weinan Zhang, and Jun Wang\. 2024\.AlphaZero\-Like Tree\-Search can Guide Large Language Model Decoding and Training\. In*International Conference on Machine Learning*\. PMLR, 49890–49920\.
- Wei et al\.\(2025\)Anjiang Wei, Huanmi Tan, Tarun Suresh, Daniel Mendoza, Thiago S\.F\.X\. Teixeira, Ke Wang, Caroline Trippel, and Alex Aiken\. 2025\.VeriCoder: Enhancing LLM\-Based RTL Code Generation Through Functional Correctness Validation\. In*arXiv preprint arXiv:2504\.15659*\.
- Yang et al\.\(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.2025\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*\(2025\)\.
- Ye et al\.\(2025\)Yufan Ye, Ting Zhang, Wenbin Jiang, and Hua Huang\. 2025\.Process\-Supervised Reinforcement Learning for Code Generation\. In*EMNLP*\.
- Yubeaton et al\.\(2025\)Patrick Yubeaton, Andre Nakkab, Weihua Xiao, Luca Collini, Ramesh Karri, Chinmay Hegde, and Siddharth Garg\. 2025\.Verithoughts: Enabling automated verilog code generation using reasoning and formal verification\.*arXiv preprint arXiv:2505\.20302*\(2025\)\.
- Zhang et al\.\(2024\)Tianjun Zhang, Shishir G Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E Gonzalez\. 2024\.RAFT: Adapting Language Model to Domain Specific RAG\.*CoRR*\(2024\)\.
- Zhang et al\.\(2025\)Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al\.2025\.Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models\.*arXiv preprint arXiv:2506\.05176*\(2025\)\.
- Zhao et al\.\(2024\)Yang Zhao, Di Huang, Chongxiao Li, Pengwei Jin, Ziyuan Nan, Tianyun Ma, Lei Qi, Yansong Pan, Zhenxing Zhang, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Xing Hu, and Yunji Chen\. 2024\.CodeV: Empowering LLMs for Verilog Generation through Multi\-Level Summarization\.arXiv:2407\.10424 \[cs\.PL\]Similar Articles
Unsupervised Process Reward Models
This paper proposes unsupervised Process Reward Models (uPRM) that eliminate the need for human annotations by using LLM next-token probabilities to identify erroneous reasoning steps, achieving up to 15% accuracy improvements over LLM-as-a-Judge and performing comparably to supervised PRMs as verifiers and reward signals.
Rubric-Guided Process Reward for Stepwise Model Routing
RoRo introduces a rubric-guided process reward framework for stepwise model routing in Large Reasoning Models, using process rewards alongside outcome rewards to train a routing policy via GRPO, outperforming baselines on reasoning benchmarks.
I created an LLM post-training method called RPS. Preliminary results show that it improved Qwen3-8b's program synthesis reliability. [R]
RPS is a two-stage LLM post-training method inspired by neuroscience, combining curriculum learning with learning rate decay. Preliminary results show improved program synthesis reliability on Qwen3-8b compared to equal learning rate training.
When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL
This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.
SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
SCI-PRM introduces a tool-aware Process Reward Model for scientific reasoning, trained on the SCIPRM70K dataset featuring 'Chain-of-Tool' trajectories that interleave reasoning with scientific tool execution. It enables effective test-time scaling and serves as a dense reward signal in reinforcement learning, outperforming proprietary models like GPT-5-Mini on tool-calling steps across scientific benchmarks.