Constrained Code Generation with Discrete Diffusion
Summary
This paper introduces Constrained Diffusion for Code (CDC), a training-free neurosymbolic inference framework that integrates constraint satisfaction directly into the reverse denoising process of discrete diffusion models for code generation. CDC consistently improves constraint satisfaction in functional correctness, security, and syntax across benchmarks, outperforming existing diffusion and autoregressive baselines.
View Cached Full Text
Cached at: 05/19/26, 06:35 AM
# Constrained Code Generation with Discrete Diffusion
Source: [https://arxiv.org/html/2605.16829](https://arxiv.org/html/2605.16829)
Lize Shao University of Virginia zgr3et@virginia\.edu&Michael Cardei11footnotemark:1 University of Virginia ntr2rm@virginia\.edu&Zichen Xie University of Virginia graysonxie@virginia\.edu&Ferdinando Fioretto University of Virginia fioretto@virginia\.edu&Wenxi Wang22footnotemark:2 University of Virginia wenxiw@virginia\.edu
###### Abstract
Discrete diffusion models are a powerful, emerging paradigm for code generation\. They construct programs through iterative refinement of partially corrupted token sequences and enable parallel token refinement\. Importantly, this paradigm exposes a global program state at each denoising step, which provides a natural intervention point for enforcing program\-level functionality and security constraints, guiding the generation before the final code is committed\. Building on this observation, the paper introducesConstrained Diffusion for Code\(CDC\), a training\-free neurosymbolic inference framework that integrates constraint satisfaction directly into the reverse denoising process\. CDC augments the base discrete diffusion sampler with constraint\-aware denoising operators that combine mathematical optimization with program analysis to identify constraint\-relevant regions of the intermediate program state and locally adjust the denoising trajectory, steering generation toward feasible programs while remaining close to the base model\. Across code generation benchmarks, CDC consistently improves constraint satisfaction in functional correctness, security, and even syntax, outperforming discrete diffusion and autoregressive baselines with less corrective computation and more localized edits\.
Figure 1:CDC vs\. other diffusion code baselines on HumanEval\-X \(HE\-X\), MBPP, CWEval, and LLMSecEval\+\.
## 1Introduction
Code generation has become an increasingly important capability in modern generative modeling systems, with notable applications in program synthesis, developer assistance, and agentic software workflows\[[16](https://arxiv.org/html/2605.16829#bib.bib23),[33](https://arxiv.org/html/2605.16829#bib.bib5),[31](https://arxiv.org/html/2605.16829#bib.bib6)\]\. In this setting, discrete diffusion language models are emerging as a highly promising paradigm, constructing programs through iterative refinement of partially corrupted token sequences\[[10](https://arxiv.org/html/2605.16829#bib.bib26),[29](https://arxiv.org/html/2605.16829#bib.bib24)\]\. However, the objective used to train generative models does not by itself capture the requirements that make generated code correct or secure\. These models are trained to match a target distribution and produce high\-fidelity samples, but a program may be highly plausible under the base model while still producing incorrect behavior, or containing an exploitable vulnerability\. In practice, generated code must satisfy explicit program\-level constraints, including syntactic validity, and more importantly, functional correctness, and security properties\. Since programs are executable objects, violations of these constraints can lead directly to runtime errors, incorrect outputs, or security failures\[[21](https://arxiv.org/html/2605.16829#bib.bib7),[3](https://arxiv.org/html/2605.16829#bib.bib8)\]\. Incorporating such constraints is therefore critical for both code quality and safety\.
These requirements expose a central challenge for code generation:*Program\-level constraints often depend on global structure or semantic properties that are not visible from any single local token decision*\. More broadly, common guidance mechanisms such as prompting, reranking, rejection sampling, or post\-hoc repair typically intervene only after candidate programs have already been generated\[[27](https://arxiv.org/html/2605.16829#bib.bib9),[8](https://arxiv.org/html/2605.16829#bib.bib10)\]\. In code generation, this is particularly limiting because the underlying search space is highly combinatorial, while the subset of programs satisfying semantic or security constraints may constitute only a small feasible region under the model distribution\. Thus, post\-hoc approaches can be both computationally inefficient and unreliable, especially when constraint satisfaction requires steering generation away from the model’s dominant unconstrained modes and toward rarer but feasible solutions\. In contrast, as this paper shows, discrete diffusion models offer a different opportunity: their approach to refine the entire sequence through intermediate denoising states exposes a global program representation throughout sampling\. This structure creates natural intervention points for integrating constraints directly into the generated code during the denoising process\.
Building on this opportunity, this paper introducesConstrained Diffusion for Code\(CDC\), a training\-free neurosymbolic framework that integrates program\-level constraints \(i\.e\., functionality and security\) directly into the reverse denoising process of discrete diffusion models\. CDC treats diffusion sampling as an editable program trajectory: at each denoising step, the model proposes a full clean\-state distribution over the program; CDC decodes this proposal, evaluates the resulting candidate against program\-level constraints, localizes feedback to relevant regions, and applies a constraint\-aware correction before sampling continues\. Each correction follows a localized constrained\-sampling principle: reduce constraint violation while remaining close to the base denoiser, revising constraint\-relevant regions without unnecessarily perturbing the rest of the program\. The framework supports both soft and hard feedback\. For functional correctness, CDC instantiates this principle as GradGuide, which uses soft surrogate execution signals to localize likely errors and guide targeted denoising updates\. For security, CDC instantiates it as MDFI, which uses static program analysis to identify vulnerable regions and correct them through localized remasking, insertion, and feedback injection\. In this way, CDC exploits the global, editable states exposed by diffusion sampling to perform focused constraint correction during generation, rather than relying on post\-hoc filtering or whole\-program regeneration after failure\.
This work makes three main contributions:
1. 1\.Constraint\-aware localization\.It introduces a general localization mechanism for constrained diffusion code generation, mapping program\-level feedback from surrogates or static analyzers to the token regions that should be revised during denoising\.
2. 2\.Constraint\-aware correction\.It proposes training\-free correction operators that revise these localized regions during reverse diffusion while staying close to the base denoiser\. The operators support differentiable surrogate guidance through gradient and KL\-anchored updates, and non\-differentiable symbolic analyzer feedback through targeted remasking, mask insertion, and feedback injection\.
3. 3\.Empirical evaluation\.It provides empirical evidence that denoising\-time constraint integration improves constrained code generation across functional\-correctness and security\-oriented benchmarks\. As summarized in FigureLABEL:fig:strong\_results, CDC improves functionality from34\.1%34\.1\\%to65\.2%65\.2\\%on HumanEval\-X C\+\+ and from27\.7%27\.7\\%to59\.2%59\.2\\%on MBPP\-C\+\+ over Dream\-Coder 7B, while MDFI improves CWEval joint functionality\-security success from12\.04%12\.04\\%to34\.26%34\.26\\%over state of the art diffusion with far fewer edited tokens\. Although CDC targets functionality and security, it also improves syntactic correctness from67\.1%67\.1\\%to79\.2%79\.2\\%on HumanEval\-X C\+\+ and from41\.1%41\.1\\%to72\.0%72\.0\\%on MBPP\-C\+\+\.
## 2Related Work
Discrete diffusion language models generate sequences through iterative denoising rather than left\-to\-right decoding, exposing partially generated global states throughout sampling\. Recent code\-oriented diffusion models including DiffuCoder\[[10](https://arxiv.org/html/2605.16829#bib.bib26)\]and Dream\-Coder 7B\[[29](https://arxiv.org/html/2605.16829#bib.bib24)\]show that this paradigm can achieve competitive code\-generation performance while supporting flexible, non\-autoregressive refinement\. CDC builds on this property: rather than using diffusion only as an alternative decoder, it treats intermediate denoising states as editable program representations where constraints can be evaluated, localized, and incorporated before generation is complete\.
Controllable code generation has been studied primarily in autoregressive models through constrained decoding, reranking, rejection sampling, verification, and iterative repair\[[14](https://arxiv.org/html/2605.16829#bib.bib12),[19](https://arxiv.org/html/2605.16829#bib.bib13),[8](https://arxiv.org/html/2605.16829#bib.bib10),[18](https://arxiv.org/html/2605.16829#bib.bib15),[4](https://arxiv.org/html/2605.16829#bib.bib14)\]\. These methods can improve syntactic validity, type consistency, functional correctness, or security, but they typically operate either at the next\-token level or after a complete candidate has been generated\. This is limiting for program\-level constraints: semantic and security violations often depend on global dataflow or execution behavior, and repairing them post hoc may require regenerating large suffixes or entire programs\. CDC instead injects feedback during denoising, when the full program state is visible and localized regions can still be revised\.
Recent work has explored controllable generation for discrete diffusion models through classifier guidance, classifier\-free guidance, hidden\-state optimization, and projection\-based constrained sampling\[[26](https://arxiv.org/html/2605.16829#bib.bib16),[11](https://arxiv.org/html/2605.16829#bib.bib17),[6](https://arxiv.org/html/2605.16829#bib.bib4),[9](https://arxiv.org/html/2605.16829#bib.bib18)\]\. Closest to our setting, Mündler et al\.\[[17](https://arxiv.org/html/2605.16829#bib.bib28)\]enforce context\-free grammar constraints during diffusion\-LLM decoding, improving syntactic correctness for code\. CDC targets a broader class of program\-level constraints, including functional and security requirements that cannot generally be expressed as grammar membership or local token restrictions\. Rather than applying global guidance or syntax\-only filtering, CDC localizes surrogate or analyzer feedback to constraint\-relevant code regions and revises those regions during the reverse diffusion trajectory\.
## 3Problem Setting
Code generation with target specifications\.Let𝒙0=\(x01,…,x0L\)\\bm\{x\}\_\{0\}=\(x\_\{0\}^\{1\},\\dots,x\_\{0\}^\{L\}\)denote a length\-LLsequence of discrete tokens, where each token takes values in a vocabulary𝒱\\mathcal\{V\}\. In code generation,𝒱\\mathcal\{V\}may include programming\-language keywords, identifiers, literals, punctuation, operators, and special formatting tokens \(e\.g\.,def,if,=,\(,\),:, variable names\)\. A code generation model defines a conditional distribution over token sequences given a contextcc, where the functionality or security requirements may be specified through a problem description, function signature, partial program, unit tests, or other task\-specific information\. We refer to these requirements as the*target specification*; Figure[2](https://arxiv.org/html/2605.16829#S5.F2)shows an example\. The goal is therefore to generate a sequence𝒙0\\bm\{x\}\_\{0\}that is both plausible under the learned code distribution and valid with respect to the target specification\.
Constrained sampling with hard and soft constraints\.Letpθ\(𝒙∣c\)p\_\{\\theta\}\(\\bm\{x\}\\mid c\)denote a base code generation model\. Unconstrained decoding samples likely programs frompθp\_\{\\theta\}; constrained decoding instead defines a samplerpθ𝒞\(𝒙∣c\)p\_\{\\theta\}^\{\\mathcal\{C\}\}\(\\bm\{x\}\\mid c\)that remains close to the base model while steering generation toward the feasible set𝒞\(c\)\\mathcal\{C\}\(c\)specified by the task\. We distinguish two kinds of constraints\.*Hard constraints*are feasibility conditions that must hold exactly, such as parsing, compiling, passing required tests, or satisfying security checks\.*Soft constraints*provide graded feedback before exact feasibility is reached, such as surrogate correctness scores, partial test\-pass rates, or analyzer severity scores\. CDC uses both: hard constraints define the target feasible set, while soft constraints provide useful denoising\-time guidance toward it\.
## 4Preliminaries: Masked Diffusion Models
Masked diffusion language models generate a sequence by reversing a discrete noising process\[[32](https://arxiv.org/html/2605.16829#bib.bib25),[25](https://arxiv.org/html/2605.16829#bib.bib2),[10](https://arxiv.org/html/2605.16829#bib.bib26)\]\. Starting from a clean token sequence𝒙0\\bm\{x\}\_\{0\}, the forward process gradually corrupts tokens into an absorbing mask token\[MASK\]; once a token is masked, it remains masked at later timesteps\[[25](https://arxiv.org/html/2605.16829#bib.bib2),[20](https://arxiv.org/html/2605.16829#bib.bib27)\]\. For each positioni∈\[L\]i\\in\[L\], the forward transition is
q\(xti∣xt−1i\)=Cat\(xti;αt∣t−1xt−1i\+\(1−αt∣t−1\)𝙼\),q\(x\_\{t\}^\{i\}\\mid x\_\{t\-1\}^\{i\}\)=\\mathrm\{Cat\}\\\!\\left\(x\_\{t\}^\{i\};\\;\\alpha\_\{t\\mid t\-1\}x\_\{t\-1\}^\{i\}\+\\bigl\(1\-\\alpha\_\{t\\mid t\-1\}\\bigr\)\{\\tt\\bm\{M\}\}\\right\),\(1\)whereCat\(⋅;π\)\\mathrm\{Cat\}\(\\cdot;\\pi\)is a categorical distribution,𝙼\{\\tt\\bm\{M\}\}is the one\-hot representation of\[MASK\], andαt∣t−1:=αt/αt−1\\alpha\_\{t\\mid t\-1\}:=\\alpha\_\{t\}/\\alpha\_\{t\-1\}for a decreasing noise schedule withα0=1\\alpha\_\{0\}=1andαT=0\\alpha\_\{T\}=0\. Thus, an unmasked token is preserved with probabilityαt∣t−1\\alpha\_\{t\\mid t\-1\}and masked otherwise\.
The reverse process uses a neural denoiser to fill in masked positions\. Given a partially masked sequence𝒙t\\bm\{x\}\_\{t\}at timesteptt, the denoiser predicts a clean\-token distribution𝒙^0\(t\)=𝒙θ\(𝒙t,t\)\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}=\\bm\{x\}\_\{\\theta\}\(\\bm\{x\}\_\{t\},t\)\. The reverse transition is
pθ\(xt−1i∣𝒙t\)=\{Cat\(xt−1i;xti\),ifxti≠\[MASK\],Cat\(xt−1i;γt𝙼\+ηt𝒙θi\(𝒙t,t\)\),ifxti=\[MASK\],p\_\{\\theta\}\(x\_\{t\-1\}^\{i\}\\mid\\bm\{x\}\_\{t\}\)=\\begin\{cases\}\\mathrm\{Cat\}\(x\_\{t\-1\}^\{i\};x\_\{t\}^\{i\}\),&\\text\{if \}x\_\{t\}^\{i\}\\neq\\texttt\{\[MASK\]\},\\\\\[8\.00003pt\] \\mathrm\{Cat\}\\\!\\left\(x\_\{t\-1\}^\{i\};\\;\\gamma\_\{t\}\{\\tt\\bm\{M\}\}\+\\eta\_\{t\}\\bm\{x\}\_\{\\theta\}^\{i\}\(\\bm\{x\}\_\{t\},t\)\\right\),&\\text\{if \}x\_\{t\}^\{i\}=\\texttt\{\[MASK\]\},\\end\{cases\}\(2\)for each positioni∈\[L\]i\\in\[L\]and timestept=T,…,1t=T,\\dots,1, withγt=1−αt−11−αt,ηt=αt−1−αt1−αt\\gamma\_\{t\}=\\frac\{1\-\\alpha\_\{t\-1\}\}\{1\-\\alpha\_\{t\}\},\\eta\_\{t\}=\\frac\{\\alpha\_\{t\-1\}\-\\alpha\_\{t\}\}\{1\-\\alpha\_\{t\}\}balancing mask retention and sampling from the denoiser’s clean\-token prediction\.
At inference time, sampling starts from the fully masked sequence𝒙T\\bm\{x\}\_\{T\}and repeatedly applies Eq\.[2](https://arxiv.org/html/2605.16829#S4.E2)until it obtains the generated sequence𝒙0\\bm\{x\}\_\{0\}\.
## 5Constrained Diffusion for Code Generation
Figure 2:Overview of CDC\.The formulation above motivates*Constrained Diffusion for Code*\(CDC\): at each timestep, the denoiser proposes a full clean program distribution𝒙^0\(t\)\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}, creating a natural point to evaluate, localize, and incorporate program\-level constraints during reverse denoising\.
### 5\.1CDC Framework
CDC is training\-free and treats reverse diffusion as an editable trajectory that integrates base denoising with constraint\-aware interventions\. Starting from a fully masked program𝒙T=\(\[MASK\],…,\\bm\{x\}\_\{T\}\\\!=\\\!\(\{\\texttt\{\[MASK\]\}\},\\dots,\[MASK\]\)\{\\texttt\{\[MASK\]\}\}\), sampling proceeds backward through timestepst=T,…,1t=T,\\dots,1until a clean program𝒙0\\bm\{x\}\_\{0\}is produced\. The overview of CDC is shown in Figure[2](https://arxiv.org/html/2605.16829#S5.F2)\. At each timestep, CDC performs four steps:
Step 1: Clean\-state proposal\.Given the current partially masked sequence𝒙t\\bm\{x\}\_\{t\}, the base denoiser predicts a per\-position categorical distribution over clean tokens,
𝒙^0\(t\)=𝒙θ\(𝒙t,t\)∈ΔL×\|𝒱\|\.\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}=\\bm\{x\}\_\{\\theta\}\(\\bm\{x\}\_\{t\},t\)\\in\\Delta^\{L\\times\|\\mathcal\{V\}\|\}\.\(3\)This distribution represents the model’s unconstrained clean\-state proposal at timesteptt, which can be decoded into an intermediate program candidate for constraint evaluation, for example by taking the per\-position maximum,
x¯0\(t\),i=argmaxv∈𝒱x^0\(t\),i\(v\),i∈\[L\]\.\\bar\{x\}\_\{0\}^\{\(t\),i\}=\\arg\\max\_\{v\\in\\mathcal\{V\}\}\\hat\{x\}\_\{0\}^\{\(t\),i\}\(v\),\\qquad i\\in\[L\]\.\(4\)
Step 2: Constraint\-aware localization\.A constraint evaluatorℰ\\mathcal\{E\}then evaluates this candidate𝒙¯0\(t\)\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}under the task contextccand constraint specification𝒞\\mathcal\{C\}\. The evaluatorℰ\\mathcal\{E\}may be instantiated by any constraint\-scoring mechanism, such as a symbolic program analyzer, a verifier, or a learned surrogate model\. The evaluator returns violation scores𝝂t\\boldsymbol\{\\nu\}\_\{t\}and structured feedbackrtr\_\{t\}\(e\.g\., vulnerability report\), which are mapped by a constraint\-localization mapℳ\\mathcal\{M\}to editable regions𝒮t\\mathcal\{S\}\_\{t\}:
\(𝝂t,rt\)=ℰ\(𝒙¯0\(t\),c;𝒞\),𝒮t=ℳ\(𝒙¯0\(t\),c,rt;𝒞\)⊆\[L\]\.\(\\boldsymbol\{\\nu\}\_\{t\},r\_\{t\}\)=\\mathcal\{E\}\(\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\\\!\\\!,c;\\;\\mathcal\{C\}\),\\qquad\\mathcal\{S\}\_\{t\}=\\mathcal\{M\}\(\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\\\!\\\!,c,r\_\{t\};\\;\\mathcal\{C\}\)\\subseteq\[L\]\.\(5\)
The mapℳ\\mathcal\{M\}is induced by the form of feedback returned byℰ\\mathcal\{E\}, and translates evaluator feedback into token positions or spans that are responsible for the constraint violation\. The set𝒮t\\mathcal\{S\}\_\{t\}identifies where the denoising trajectory should be revised, with specific instantiations given in Section[5\.2](https://arxiv.org/html/2605.16829#S5.SS2)\. This separates the question of*where*to edit from the question of*how*to modify the denoising trajectory, which is the focus of the next step\.
Step 3: Constraint\-aware correction\.Given the localized edit region𝒮t\\mathcal\{S\}\_\{t\}and feedbackrtr\_\{t\}, CDC applies a constraint\-aware denoising operator𝒫𝒞\\mathcal\{P\}\_\{\\mathcal\{C\}\}to produce a corrected clean\-state proposal:
𝒚t=𝒫𝒞\(𝒙^0\(t\),𝒮t,rt,c\),\\bm\{y\}\_\{t\}=\\mathcal\{P\}\_\{\\mathcal\{C\}\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},\\mathcal\{S\}\_\{t\},r\_\{t\},c\),\(6\)
The operator is designed to reduce constraint violation while keeping𝒚t\\bm\{y\}\_\{t\}close to the base denoiser output𝒙^0\(t\)\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\. CDC represents constraint feedback through violation penalties that may come from differentiable surrogate scores, verifier outputs, or security diagnostics\. For constraintjjat timesteptt, letνt,j\(𝒚;rt,c,𝒮t\)\\nu\_\{t,j\}\(\\bm\{y\};r\_\{t\},c,\\mathcal\{S\}\_\{t\}\)denote the feedback\-induced penalty under candidate proposal𝒚\\bm\{y\}, and letλj≥0\\lambda\_\{j\}\\geq 0denote its penalty coefficient\. The resulting localized trust\-region update is:
𝒚t=argmin𝒚∈ΔL×\|𝒱\|\{DKL\(𝒚∥𝒙^0\(t\)\)\+∑j=1mλjνt,j\(𝒚;rt,c,𝒮t\)⏟Vt\(𝒚;rt,c,𝒮t\)\}\.\\bm\{y\}\_\{t\}=\\arg\\min\_\{\\bm\{y\}\\in\\Delta^\{L\\times\|\\mathcal\{V\}\|\}\}\\Bigl\\\{D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\bm\{y\}\\,\\\|\\,\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\\right\)\+\\underbrace\{\\sum\_\{j=1\}^\{m\}\\lambda\_\{j\}\\,\\nu\_\{t,j\}\(\\bm\{y\};r\_\{t\},c,\\mathcal\{S\}\_\{t\}\)\}\_\{V\_\{t\}\(\\bm\{y\};r\_\{t\},c,\\mathcal\{S\}\_\{t\}\)\}\\Bigr\\\}\.\(7\)Here,DKLD\_\{\\mathrm\{KL\}\}denotes Kullback–Leibler divergence, which anchors the correction to the base proposal, andVtV\_\{t\}is the aggregate violation penalty\. The localization set𝒮t\\mathcal\{S\}\_\{t\}scopes the correction: some operators enforce locality by anchoring positions outside𝒮t\\mathcal\{S\}\_\{t\}, while others use𝒮t\\mathcal\{S\}\_\{t\}as a remasking target or feedback anchor\. Section[5\.2](https://arxiv.org/html/2605.16829#S5.SS2)instantiates this interface for differentiable surrogate feedback for program functionality, and symbolic program\-analysis feedback for program security\.
Step 4: Constrained reverse update\.Finally, CDC advances the reverse chain using the corrected proposal𝒚t\\bm\{y\}\_\{t\}through a constraint\-aware reverse kernel,
𝒙t−1∼pθ𝒞\(⋅∣𝒙t,𝒚t\)\.\\bm\{x\}\_\{t\-1\}\\sim p\_\{\\theta\}^\{\\mathcal\{C\}\}\(\\cdot\\mid\\bm\{x\}\_\{t\},\\bm\{y\}\_\{t\}\)\.\(8\)A representative transition replaces the base clean\-token distribution prediction with the corrected proposal:
pθ𝒞\(xt−1i∣𝒙t,𝒚t\)=\{Cat\(xt−1i;xti\),ifxti≠\[MASK\],Cat\(xt−1i;γt𝙼\+ηt𝒚ti\),ifxti=\[MASK\]\.p\_\{\\theta\}^\{\\mathcal\{C\}\}\(x\_\{t\-1\}^\{i\}\\mid\\bm\{x\}\_\{t\},\\bm\{y\}\_\{t\}\)=\\begin\{cases\}\\mathrm\{Cat\}\(x\_\{t\-1\}^\{i\};x\_\{t\}^\{i\}\),&\\text\{if \}x\_\{t\}^\{i\}\\neq\\texttt\{\[MASK\]\},\\\\\[8\.00003pt\] \\mathrm\{Cat\}\\\!\\left\(x\_\{t\-1\}^\{i\};\\;\\gamma\_\{t\}\{\\tt\\bm\{M\}\}\+\\eta\_\{t\}\\bm\{y\}\_\{t\}^\{i\}\\right\),&\\text\{if \}x\_\{t\}^\{i\}=\\texttt\{\[MASK\]\}\.\\end\{cases\}\(9\)Here,𝒚ti\\bm\{y\}\_\{t\}^\{i\}is the corrected clean\-token distribution at positionii\. Depending on the operator, positions outsideStS\_\{t\}may either remain anchored to the base proposal or receive weaker global adjustments, while positions insideStS\_\{t\}receive the strongest constraint\-aware corrections through modified token distributions, targeted remasking, or feedback\-conditioned revision, as described in Section[5\.2](https://arxiv.org/html/2605.16829#S5.SS2)\. The resulting reverse trajectory is illustrated below\.
𝒙t→𝒙θ\(⋅,t\)𝒙^0\(t\)→Decode𝒙¯0\(t\)→ℰ,ℳ\(𝝂t,rt,𝒮t\)→𝒫𝒞𝒚t→pθ𝒞𝒙t−1\.\\bm\{x\}\_\{t\}\\xrightarrow\{\\;\\bm\{x\}\_\{\\theta\}\(\\cdot,t\)\\;\}\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\\xrightarrow\{\\;\\mathrm\{Decode\}\\;\}\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\\xrightarrow\{\\;\\mathcal\{E\},\\,\\mathcal\{M\}\\;\}\(\\boldsymbol\{\\nu\}\_\{t\},r\_\{t\},\\mathcal\{S\}\_\{t\}\)\\xrightarrow\{\\;\\mathcal\{P\}\_\{\\mathcal\{C\}\}\\;\}\\bm\{y\}\_\{t\}\\xrightarrow\{\\;p\_\{\\theta\}^\{\\mathcal\{C\}\}\\;\}\\bm\{x\}\_\{t\-1\}\.
### 5\.2Operator Instantiations
We now show how CDC becomes concrete by instantiating the three components of the framework: evaluatorℰ\\mathcal\{E\}, localization mapℳ\\mathcal\{M\}, and correction operator𝒫𝒞\\mathcal\{P\}\_\{\\mathcal\{C\}\}\. Following the problem setting \(Section[3](https://arxiv.org/html/2605.16829#S3)\), hard constraints define feasibility, while soft constraints provide graded signals that can guide the sampler before feasibility is reached\. Our first instantiation, GradGuide, uses this soft\-feedback route for program functionality: a learned surrogate relaxes hard execution outcomes into soft scores, localizes likely functional errors with gradients, and corrects the localized region through KL\-anchored proposal updates and remasking\. Our second instantiation, Mid\-Diffusion Feedback Injection \(MDFI\), uses the hard\-feedback route for program security: a static analyzer reports concrete vulnerability witnesses on partial programs, localizes them on a program graph, and corrects the corresponding regions through targeted remasking and feedback injection\.
Surrogate\-gradient operator: GradGuide\.Functional correctness is ultimately a hard constraint: the final program must pass the required tests\. During denoising, however, this hard constraint is not differentiable, so GradGuide instantiates the evaluatorℰGG\\mathcal\{E\}^\{\\mathrm\{GG\}\}with an auxiliary surrogategϕg\_\{\\phi\}trained ahead of time with execution\-driven labels \(e\.g\., test\-pass outcomes\), while keeping the diffusion model parametersθ\\thetafixed\. Given the clean\-state proposal𝒙^0\(t\)\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}, GradGuide maps each predicted token distribution to an expected embedding,
Emb\(𝒙^0\(t\)\)i=∑v∈𝒱x^0\(t\),i\(v\)𝐞v=x^0\(t\),i𝐄tok,i∈\[L\]\.\\mathrm\{Emb\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\)^\{i\}=\\sum\_\{v\\in\\mathcal\{V\}\}\\hat\{x\}\_\{0\}^\{\(t\),i\}\(v\)\\mathbf\{e\}\_\{v\}=\\hat\{x\}\_\{0\}^\{\(t\),i\}\\mathbf\{E\}\_\{\\mathrm\{tok\}\},\\qquad i\\in\[L\]\.\(10\)where𝐄tok\\mathbf\{E\}\_\{\\mathrm\{tok\}\}is the token\-embedding matrix, and𝐞v\\mathbf\{e\}\_\{v\}denote the embedding of tokenvv\.
The surrogate predicts a satisfaction scoreνϕ,jGG\(Emb\(𝒙^0\(t\)\),c\)\{\\nu\}^\{\\mathrm\{GG\}\}\_\{\\phi,j\}\(\\mathrm\{Emb\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\),c\)for each constraintjj, which is converted into the relaxed violation
ΔνjGG\(𝒙^0\(t\),c\)=max\(0,τj−νϕ,jGG\(Emb\(𝒙^0\(t\)\),c\)\),VtGG=∑j=1mΔνjGG\.\\Delta\\nu^\{\\mathrm\{GG\}\}\_\{j\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)=\\max\\\!\\left\(0,\\,\\tau\_\{j\}\-\\nu^\{\\mathrm\{GG\}\}\_\{\\phi,j\}\(\\mathrm\{Emb\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\),c\)\\right\),\\qquad V^\{\\mathrm\{GG\}\}\_\{t\}=\\sum\_\{j=1\}^\{m\}\\Delta\\nu^\{\\mathrm\{GG\}\}\_\{j\}\.\(11\)GradGuide localizes edits with one backward pass throughgϕg\_\{\\phi\}\. The saliency of positioniiis
ai=‖∇Emb\(𝒙^0\(t\)\)iVtGG\(𝒙^0\(t\),c\)‖2,𝒮tGG=Expand\(Top\-k\(a1,…,aL\)\)\.a\_\{i\}=\\left\\\|\\nabla\_\{\\mathrm\{Emb\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\)^\{i\}\}V^\{\\mathrm\{GG\}\}\_\{t\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)\\right\\\|\_\{2\},\\qquad\\mathcal\{S\}\_\{t\}^\{\\mathrm\{GG\}\}=\\mathrm\{Expand\}\\\!\\Bigl\(\\mathrm\{Top\}\\text\{\-\}k\(a\_\{1\},\\dots,a\_\{L\}\)\\Bigr\)\.\(12\), whereExpand\(⋅\)\\mathrm\{Expand\}\(\\cdot\)maps selected token positions to syntactically coherent edit spans by enlarging each position to a local syntactic window\. Thus,𝒮tGG\\mathcal\{S\}\_\{t\}^\{\\mathrm\{GG\}\}contains the syntactically expanded token spans that most influence the surrogate\-predicted violation\.
Given𝒮tGG\\mathcal\{S\}\_\{t\}^\{\\mathrm\{GG\}\}, GradGuide computes a corrected proposal by approximately minimizing Eq\.[7](https://arxiv.org/html/2605.16829#S5.E7)withVtV\_\{t\}replaced byVtGGV^\{\\mathrm\{GG\}\}\_\{t\}\.
The resulting𝒚t\\bm\{y\}\_\{t\}is used in the constrained reverse kernel\. When the intermediate program is sufficiently decoded but still violates the hard functional constraint, GradGuide also reopens the localized region by settingxti=\[MASK\]x\_\{t\}^\{i\}=\\texttt\{\[MASK\]\}fori∈𝒮tGGi\\in\\mathcal\{S\}\_\{t\}^\{\\mathrm\{GG\}\}, allowing the remaining denoising steps to regenerate those positions under the corrected proposal\. Additional details, including surrogate training and other implementation details, are provided in Appendix[C](https://arxiv.org/html/2605.16829#A3)\.
Program\-analysis\-guided operator: MDFI\.Security requirements are hard constraints: a program is feasible only if the relevant vulnerability checks pass\. For these constraints, we use a static program analyzer to instantiate the evaluatorℰ\\mathcal\{E\}, which is applied to the intermediate proposal𝒙¯0\(t\)\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}at selected denoising checkpoints\. The analyzer constructs a partial program graph𝒢t\\mathcal\{G\}\_\{t\}from the partial program structure and returns vulnerability witnesses
wk=\(nk,τk,hk\),w\_\{k\}=\(n\_\{k\},\\tau\_\{k\},h\_\{k\}\),\(13\)wherenkn\_\{k\}is the offending graph node,τk∈\{sub,ins\}\\tau\_\{k\}\\in\\\{\\mathrm\{sub\},\\mathrm\{ins\}\\\}indicates whether the repair is substitution\- or insertion\-like, andhkh\_\{k\}is a remediation hint\. The witness set𝒲t=\{wk\}k=1Kt\\mathcal\{W\}\_\{t\}=\\\{w\_\{k\}\\\}\_\{k=1\}^\{K\_\{t\}\}defines the structured feedbackrtMDFIr\_\{t\}^\{\\mathrm\{MDFI\}\}and the corresponding hard violation vector𝝂tMDFI\\boldsymbol\{\\nu\}\_\{t\}^\{\\mathrm\{MDFI\}\}\.
MDFI localizes each witness by taking its analyzer\-supported abstract syntax tree and dataflow neighborhood and projecting it back to token positions:
𝒮tMDFI=TopBudgetB\(⋃k=1KtTok\(𝒩\(nk;𝒢t\)\)\)⊆\[L\]\.\\mathcal\{S\}\_\{t\}^\{\\mathrm\{MDFI\}\}=\\mathrm\{TopBudget\}\_\{B\}\\\!\\left\(\\bigcup\_\{k=1\}^\{K\_\{t\}\}\\mathrm\{Tok\}\\bigl\(\\mathcal\{N\}\(n\_\{k\};\\mathcal\{G\}\_\{t\}\)\\bigr\)\\right\)\\subseteq\[L\]\.\(14\)Here,TopBudgetB\(⋅\)\\mathrm\{TopBudget\}\_\{B\}\(\\cdot\)selects top token spans under a token budgetBB, prioritizing higher\-confidence analyzer witnesses, while𝒩\(nk;𝒢t\)\\mathcal\{N\}\(n\_\{k\};\\mathcal\{G\}\_\{t\}\)denotes the relevant syntactic or dataflow neighborhood around the witness, andTok\(⋅\)\\mathrm\{Tok\}\(\\cdot\)maps graph nodes back to token spans\. Thus,𝒮tMDFI\\mathcal\{S\}\_\{t\}^\{\\mathrm\{MDFI\}\}contains diagnostic\-supported regions\.
Given𝒮tMDFI\\mathcal\{S\}\_\{t\}^\{\\mathrm\{MDFI\}\}andrtMDFIr\_\{t\}^\{\\mathrm\{MDFI\}\}, MDFI applies three discrete interventions\. Let𝒮tsub\\mathcal\{S\}\_\{t\}^\{\\mathrm\{sub\}\}and𝒮tins\\mathcal\{S\}\_\{t\}^\{\\mathrm\{ins\}\}denote the portions of𝒮tMDFI\\mathcal\{S\}\_\{t\}^\{\\mathrm\{MDFI\}\}supported by substitution\-type and insertion\-type witnesses, respectively:
𝒮ta=𝒮tMDFI∩⋃k:τk=aTok\(𝒩\(nk;𝒢t\)\),a∈\{sub,ins\}\.\\mathcal\{S\}\_\{t\}^\{a\}=\\mathcal\{S\}\_\{t\}^\{\\mathrm\{MDFI\}\}\\cap\\bigcup\_\{k:\\tau\_\{k\}=a\}\\mathrm\{Tok\}\\\!\\left\(\\mathcal\{N\}\(n\_\{k\};\\mathcal\{G\}\_\{t\}\)\\right\),\\qquad a\\in\\\{\\mathrm\{sub\},\\mathrm\{ins\}\\\}\.\(15\)For substitution\-type witnesses, MDFI remasks the offending committed region:
xt⋆,i=\{\[MASK\],i∈𝒮tsub,xti,i∉𝒮tsub\.x\_\{t\}^\{\\star,i\}=\\begin\{cases\}\\texttt\{\[MASK\]\},&i\\in\\mathcal\{S\}\_\{t\}^\{\\mathrm\{sub\}\},\\\\ x\_\{t\}^\{i\},&i\\notin\\mathcal\{S\}\_\{t\}^\{\\mathrm\{sub\}\}\.\\end\{cases\}\(16\)For insertion\-type witnesses, MDFI insertsKKfresh\[MASK\]tokens near a structural anchor selected from𝒮tins\\mathcal\{S\}\_\{t\}^\{\\mathrm\{ins\}\}:
𝐱t⋆←InsertK\(𝐱t⋆,anchor\(wk;𝒮tins\)\),k:τk=ins\.\\mathbf\{x\}\_\{t\}^\{\\star\}\\leftarrow\\mathrm\{Insert\}\_\{K\}\\bigl\(\\mathbf\{x\}\_\{t\}^\{\\star\},\\mathrm\{anchor\}\(w\_\{k\};\\mathcal\{S\}\_\{t\}^\{\\mathrm\{ins\}\}\)\\bigr\),\\qquad k:\\tau\_\{k\}=\\mathrm\{ins\}\.\(17\), whereInsertK\(x,a\)\\mathrm\{Insert\}\_\{K\}\(x,a\)insertsKK\[MASK\]tokens into sequencexxat anchor positionaa\. The anchor function selects a syntactically valid insertion point near the insertion\-type witness\.
The MDFI update is therefore
\(𝒙t⋆,c⋆,𝒮tMDFI\)=𝒫𝒞MDFI\(𝒙^0\(t\),𝒮tMDFI,rtMDFI,c\)\.\\bigl\(\\bm\{x\}\_\{t\}^\{\\star\},\\,c^\{\\star\},\\,\\mathcal\{S\}\_\{t\}^\{\\mathrm\{MDFI\}\}\\bigr\)=\\mathcal\{P\}\_\{\\mathcal\{C\}\}^\{\\mathrm\{MDFI\}\}\\bigl\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},\\,\\mathcal\{S\}\_\{t\}^\{\\mathrm\{MDFI\}\},\\,r\_\{t\}^\{\\mathrm\{MDFI\}\},\\,c\\bigr\)\.\(18\)Here,𝒫𝒞MDFI\\mathcal\{P\}\_\{\\mathcal\{C\}\}^\{\\mathrm\{MDFI\}\}is the MDFI instantiation of the constraint\-aware projection operator defined in Eq\.[6](https://arxiv.org/html/2605.16829#S5.E6)and operationalized by the objective in Eq\.[7](https://arxiv.org/html/2605.16829#S5.E7)\. The chain continues with𝒙t\\bm\{x\}\_\{t\}replaced by𝒙t⋆\\bm\{x\}\_\{t\}^\{\\star\}andccreplaced byc⋆c^\{\\star\}\. MDFI does not updateθ\\thetaor run an inner re\-denoising loop; it edits the current state/context in place and lets the remaining reverse steps regenerate the affected regions\. If no checkpoint fires or no witness is detected, MDFI reduces to the identity operator\. Additional details are in Appendix[D](https://arxiv.org/html/2605.16829#A4)\.
## 6Experiments
Table 1:Functional correctness \(%\) on HumanEval\-X C\+\+ and MBPP C\+\+\. Parenthesized values report the absolute change from the Vanilla model:greenan improvement of≥10\{\\geq\}10%,yellowindicates an improvement of < 10%, andreda regression\. Within each model, the best performance isbolded\.HumanEval\-X C\+\+MBPP C\+\+ModelMethodSyn\. \(compile\)Fun\. \(p@1\)Fun\. \(p@10\)Syn\. \(compile\)Fun\. \(p@1\)Fun\. \(p@10\)Dream 7BVanilla40\.210\.451\.860\.725\.459\.7CFG\-CD70\.7\(\+30\.5\)11\.0\(\+0\.6\)60\.4\(\+8\.6\)59\.4\(\-1\.3\)24\.4\(\-1\.0\)55\.9\(\-3\.8\)CDC \(Ours\)44\.5\(\+4\.3\)20\.7\(\+10\.3\)64\.6\(\+12\.8\)61\.2\(\+0\.5\)33\.8\(\+8\.4\)63\.7\(\+4\.0\)DiffuCoder 7BVanilla43\.324\.456\.153\.928\.253\.1CFG\-CD49\.4\(\+6\.1\)25\.0\(\+0\.6\)50\.0\(\-6\.1\)50\.6\(\-3\.3\)27\.0\(\-1\.2\)51\.1\(\-2\.0\)CDC \(Ours\)50\.0\(\+6\.7\)37\.2\(\+12\.8\)68\.3\(\+12\.2\)56\.9\(\+3\.0\)41\.8\(\+13\.6\)68\.0\(\+14\.9\)Dream\-Coder 7BVanilla67\.134\.155\.541\.127\.754\.2CFG\-CD61\.6\(\-5\.5\)37\.8\(\+3\.7\)53\.7\(\-1\.8\)37\.3\(\-3\.8\)25\.7\(\-2\.0\)49\.1\(\-5\.1\)CDC \(Ours\)79\.2\(\+12\.1\)65\.2\(\+31\.1\)83\.5\(\+28\.0\)72\.0\(\+30\.9\)59\.2\(\+31\.5\)74\.8\(\+20\.6\)DeepSeek\-Coder 6\.7BVanilla94\.559\.880\.588\.259\.773\.0Reprompt90\.9\(\-3\.6\)64\.0\(\+4\.2\)82\.9\(\+2\.4\)89\.2\(\+1\.0\)60\.7\(\+1\.0\)74\.1\(\+1\.1\)CodeLlama 7BVanilla90\.230\.554\.983\.946\.161\.7Reprompt81\.7\(\-8\.5\)31\.1\(\+0\.6\)55\.5\(\+0\.6\)83\.6\(\-0\.3\)47\.4\(\+1\.3\)64\.7\(\+3\.0\)
This section evaluates CDC for constrained code generation under four objectives: functional correctness, syntactic validity, security, and edit localization\. For functional correctness and syntactic validity, we instantiate CDC with GradGuide\. For security, we instantiate CDC with MDFI\. We evaluate edit localization for both instantiations to measure whether CDC concentrates corrections on constraint\-relevant regions\. Additional experimental details and results are provided in Appendix[B](https://arxiv.org/html/2605.16829#A2)\.
### 6\.1Code Functionality and Syntax
We first evaluate CDC on the core requirements of code generation: producing programs that compile \(syntactic validity\) and implement the intended behavior \(functional correctness\)\. \.
Baselines\.We evaluate CDC on three diffusion language models: Dream\-Coder\-7B\[[29](https://arxiv.org/html/2605.16829#bib.bib24)\], DiffuCoder\-7B\[[10](https://arxiv.org/html/2605.16829#bib.bib26)\], and Dream\-7B\[[32](https://arxiv.org/html/2605.16829#bib.bib25)\]\. We compare against two diffusion baselines: Vanilla, which uses the base denoising sampler without constraint intervention, and CFG\-CD\[[17](https://arxiv.org/html/2605.16829#bib.bib28)\], which uses grammar\-constrained diffusion decoding for syntax\-level constraints\. We also compare against two autoregressive code models, DeepSeek\-Coder\-Instruct\-6\.7B\[[12](https://arxiv.org/html/2605.16829#bib.bib33)\]and CodeLlama\-7B\[[24](https://arxiv.org/html/2605.16829#bib.bib1)\], under vanilla and reprompt configurations\. Reprompt first performs vanilla generation; if the output fails the constraint, test pass/fail feedback is used to launch one additional generation round\.Benchmarks\.We evaluate C\+\+ program synthesis on HumanEval\-X C\+\+ \(164 tasks\) and MBPP\-C\+\+ \(397 tasks\)\[[7](https://arxiv.org/html/2605.16829#bib.bib20),[2](https://arxiv.org/html/2605.16829#bib.bib22),[34](https://arxiv.org/html/2605.16829#bib.bib21)\]\.Evaluation Metrics\.Functionality is measured by pass@1 and pass@10: a task is solved if at least one of 1 or 10 generated samples compiles and passes all tests\. Syntax is the compile\-success rate\.
Table 2:Security\-aware code generation on CWEval and LLMSecEval\+\. Parenthesized values give the absolute change relative to vanilla model:greenmarks gains of≥10\\geq\\\!10%,yellowgains of<10<\\\!10%, andredregressions\. Within each model, the best performance isbolded\.CWEvalLLMSecEval\+ModelMethodfunc@1sec@1fs@1fs@5fs@10func@1sec@1fs@1fs@5fs@10DiffusionVanilla26\.918\.512\.025\.926\.932\.054\.714\.722\.024\.0Sec prompt26\.9\(\+0\)19\.4\(\+1\)13\.9\(\+2\)23\.2\(\-3\)27\.8\(\+1\)34\.7\(\+3\)58\.7\(\+4\)16\.0\(\+1\)22\.7\(\+1\)26\.7\(\+3\)CDC39\.8\(\+13\)41\.7\(\+23\)34\.3\(\+22\)49\.1\(\+23\)51\.9\(\+25\)30\.7\(\-1\)80\.7\(\+26\)24\.7\(\+10\)40\.7\(\+19\)45\.3\(\+21\)ARVanilla48\.224\.121\.336\.140\.737\.355\.321\.330\.738\.0Reprompt44\.4\(\-4\)47\.2\(\+23\)39\.8\(\+19\)63\.0\(\+27\)68\.5\(\+28\)40\.0\(\+3\)64\.7\(\+9\)26\.0\(\+5\)40\.0\(\+9\)48\.7\(\+11\)
Table[1](https://arxiv.org/html/2605.16829#S6.T1)reports syntax and functional\-correctness on HumanEval\-X C\+\+ and MBPP\-C\+\+\. Across diffusion backbones, CDC improves bothpass@1\\mathrm\{pass@1\}andpass@10\\mathrm\{pass@10\}over vanilla diffusion, with the largest gains on stronger code\-diffusion models\. On HumanEval\-X C\+\+, CDC improves Dream\-Coder 7B from34\.1%34\.1\\%to65\.2%65\.2\\%pass@1\\mathrm\{pass@1\}, while also increasing compile success from67\.1%67\.1\\%to79\.2%79\.2\\%\. On MBPP\-C\+\+, the same model improves from27\.7%27\.7\\%to59\.2%59\.2\\%pass@1\\mathrm\{pass@1\}, with compile success increasing from41\.1%41\.1\\%to72\.0%72\.0\\%\. CDC also improves weaker diffusion backbones, increasing Dream 7B from10\.4%10\.4\\%to20\.7%20\.7\\%on HumanEval\-X and DiffuCoder 7B from24\.4%24\.4\\%to37\.2%37\.2\\%\. The improvements are consistent atpass@10\\mathrm\{pass@10\}\.
While CFG\-CD improves grammar or compile validity in several cases, it does not consistently improve functional correctness\. For example, on MBPP\-C\+\+, CFG\-CD decreases Dream\-Coder 7Bpass@1\\mathrm\{pass@1\}from27\.7%27\.7\\%to25\.7%25\.7\\%, whereas CDC increases it to59\.2%59\.2\\%\. This difference suggests that syntax constraints alone are insufficient for program synthesis: CDC improves both validity and test\-pass rates because its surrogate signal is trained from execution outcomes rather than grammar membership alone\.
Compared with autoregressive baselines, CDC substantially narrows the functionality gap while retaining the advantages of diffusion\-style correction\. On HumanEval\-X C\+\+, CDC with Dream\-Coder 7B achieves 65\.2%pass@1\\mathrm\{pass@1\}and 83\.5pass@10\\mathrm\{pass@10\}, outperforming all 7B autoregressive baselines, including DeepSeek\-Coder with re\-prompting at 64\.0%pass@1\\mathrm\{pass@1\}and 82\.9%pass@10\\mathrm\{pass@10\}\. On MBPP\-C\+\+, CDC reaches 59\.2%pass@1\\mathrm\{pass@1\}and 74\.8%pass@10\\mathrm\{pass@10\}, approaching DeepSeek\-Coder and substantially outperforming CodeLlama\. These results suggest that denoising\-time constraint integration can make diffusion code models competitive with strong autoregressive code generators, without relying on full\-program re\-generation after feedback\. Additional regression analysis and ablations are provided in Appendix[B\.3](https://arxiv.org/html/2605.16829#A2.SS3)
### 6\.2Code Generation Security
We next evaluate whether CDC can improve security\-oriented code generation, where outputs are evaluated to satisfy both security and functionality constraints\. This setting is challenging as security fixes often require localized semantic changes while preserving the surrounding program behavior\.
Baselines\.We use Dream\-Coder\-7B as the diffusion backbone for CDC and DeepSeek\-Coder\-Instruct\-6\.7B as the autoregressive baseline\. For autoregressive models, we compare vanilla generation with AR re\-prompting\[[8](https://arxiv.org/html/2605.16829#bib.bib10)\], where an insecure or failing solution is returned to the model with feedback from static program analyzer and the model regenerates a complete program\. For diffusion models, we compare against vanilla diffusion and security prompting\.BenchmarksWe evaluate CDC on CWEval and LLMSecEval\+\[[22](https://arxiv.org/html/2605.16829#bib.bib31),[28](https://arxiv.org/html/2605.16829#bib.bib32)\]\.Metrics\.We report func@1 and sec@1, the fractions of single samples that pass the functional and security oracles, respectively, and joint func\-sec@k fork∈\{1,5,10\}k\\in\\\{1,5,10\\\}, where a task is solved if any ofkksampled outputs is both functional and secure\.
Table[2](https://arxiv.org/html/2605.16829#S6.T2)shows that CDC substantially improves security\-aware generation over vanilla diffusion and secure prompting\. On CWEval, CDC increasesfs@1\\mathrm\{fs@1\}from12\.04%12\.04\\%to34\.26%34\.26\\%, nearly tripling the joint functionality\-security success rate, and lifts both component metrics in tandem \(func@1\\mathrm\{func@1\}26\.85%→39\.81%26\.85\\%\\\!\\to\\\!39\.81\\%,sec@1\\mathrm\{sec@1\}18\.52%→41\.67%18\.52\\%\\\!\\to\\\!41\.67\\%\)\. On LLMSecEval\+, CDC improves security most strongly, increasingsec@1\\mathrm\{sec@1\}from54\.67%54\.67\\%for vanilla diffusion and58\.67%58\.67\\%with secure prompting to80\.67%80\.67\\%—higher than every other method in the table, including AR re\-prompting \(64\.67%64\.67\\%\)—while also improvingfs@1\\mathrm\{fs@1\}from14\.67%14\.67\\%to24\.67%24\.67\\%\. The improvement extends to multi\-sample regimes: CDC raises CWEvalfs@5\\mathrm\{fs@5\}/fs@10\\mathrm\{fs@10\}from25\.93%25\.93\\%/26\.85%26\.85\\%to49\.07%49\.07\\%/51\.85%51\.85\\%, and matches AR re\-prompting on LLMSecEval\+fs@5\\mathrm\{fs@5\}\(40\.67%40\.67\\%vs\.40\.00%40\.00\\%\)\.
Compared with AR re\-prompting, CDC is competitive despite using a weaker diffusion backbone for functionality\. AR re\-prompting obtains the highest overall fs@1 on both benchmarks, but CDC closes much of the gap while performing interventions during generation rather than regenerating complete programs after failure\. The per\-language CWEval results \(Table[4](https://arxiv.org/html/2605.16829#A2.T4), Appendix[B\.3](https://arxiv.org/html/2605.16829#A2.SS3)\) show that CDC is especially strong on Python, where it reaches 72\.0% fs@1, and remains competitive on JavaScript, C\+\+, and Go, suggesting that localized mid\-diffusion repair can improve security without relying only on post\-hoc complete\-program correction\. Additional efficiency analysis, regression analysis and ablations are provided in Appendix[B\.3](https://arxiv.org/html/2605.16829#A2.SS3)
### 6\.3Localization and Efficiency
Figure 3:Edited tokens per correction attempt \(fewer means higher efficiency\): \(a\) functionality corrections on HumanEval\-X and MBPP, and \(b\) security corrections on CWEval and LLMSecEval\+\.In addition to task success, we evaluate whether CDC performs targeted constraint correction rather than broad regeneration\. We measure localization and efficiency across the same functional\-correctness and security benchmarks described above\. For each intervention, we compare the constrained output to the corresponding unconstrained/base output and report edit locality statistics, including the fraction of changed tokens, number of changed spans or edit clusters, and tokens inserted or remasked\. We also measure corrective cost, including the number of additional denoising interventions, remasked tokens, generated tokens\. These metrics are used to assess whether CDC improves constraint satisfaction while concentrating changes on constraint\-relevant regions and avoiding repeated full\-program regeneration\.
Figure[3](https://arxiv.org/html/2605.16829#S6.F3)reports the number of edited tokens for security and functionality corrections, where fewer edited tokens indicate a more targeted intervention\. For security, CDC is substantially more local than AR re\-prompting on both benchmarks: AR re\-prompting rewrites a median of113113tokens on CWEval and9999tokens on LLMSecEval\+, while CDC edits only1212and3232tokens, respectively\. This supports the main motivation of CDC: because diffusion states remain editable during generation, analyzer feedback can be applied directly to the vulnerable region without regenerating the full program\.
For functionality, CDC also produces more localized corrections than AR re\-prompting\. On HumanEval\-X C\+\+, CDC edits a median of 32 tokens compared with 123 tokens for AR re\-prompting\. On MBPP\-C\+\+, CDC edits a median of 35 tokens compared with 98 tokens for AR re\-prompting\. Thus, across both security and functionality settings, CDC reduces the amount of code rewritten during correction, suggesting that its gains come from focused denoising\-time interventions rather than expensive full\-program repair\. Additional efficiency analysis and ablation are included in Appendix[B\.3](https://arxiv.org/html/2605.16829#A2.SS3)\.
## 7Conclusion
This paper introduced CDC, a training\-free neurosymbolic framework that integrates program\-level constraints into the reverse denoising process of discrete diffusion code models\. CDC uses the global, editable states exposed during sampling to evaluate intermediate programs, localize constraint feedback, and apply targeted corrections that reduce violation while staying close to the base denoiser\. The framework supports both soft and hard feedback: GradGuide uses soft surrogate signals for functional correctness, while MDFI uses hard symbolic program\-analysis signals for security repair through remasking, insertion, and feedback injection\. Across functional and security\-oriented benchmarks, CDC improves test pass rates, compile success, and joint functionality\-security success over standard diffusion decoding and syntax\-only constrained diffusion\. These results show that discrete diffusion models are a natural substrate for constrained code generation, enabling focused correction during generation instead of post\-hoc filtering or whole\-program regeneration\.
## Acknowledgments
This research is partially supported by NSF awards 2533631, 2401285, 2334936, and 2226816, by DARPA under Contract No\.\#\\\#HR0011252E005\. The authors acknowledge the Research Computing at the University of Virginia\. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF or DARPA\. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No 2234693\. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation\.
## References
- \[1\]J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. van den Berg\(2021\)Structured denoising diffusion models in discrete state\-spaces\.External Links:2107\.03006,[Link](https://arxiv.org/abs/2107.03006)Cited by:[Appendix A](https://arxiv.org/html/2605.16829#A1.p1.1)\.
- \[2\]J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le, and C\. Sutton\(2021\)Program synthesis with large language models\.External Links:2108\.07732,[Link](https://arxiv.org/abs/2108.07732)Cited by:[§B\.1](https://arxiv.org/html/2605.16829#A2.SS1.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2605.16829#S6.SS1.p2.1)\.
- \[3\]E\. Basic and A\. Giaretta\(2024\)From vulnerabilities to remediation: a systematic literature review of llms in code security\.External Links:2412\.15004,[Link](https://arxiv.org/abs/2412.15004)Cited by:[§1](https://arxiv.org/html/2605.16829#S1.p1.1)\.
- \[4\]M\. Bhatt, S\. Chennabasappa, Y\. Li, C\. Nikolaidis, D\. Song, S\. Wan, F\. Ahmad, C\. Aschermann, Y\. Chen, D\. Kapil, D\. Molnar, S\. Whitman, and J\. Saxe\(2024\)CyberSecEval 2: a wide\-ranging cybersecurity evaluation suite for large language models\.External Links:2404\.13161,[Link](https://arxiv.org/abs/2404.13161)Cited by:[Table 3](https://arxiv.org/html/2605.16829#A1.T3.1.2.1.1.1.2.1.3.1),[Appendix A](https://arxiv.org/html/2605.16829#A1.p2.1),[§2](https://arxiv.org/html/2605.16829#S2.p2.1)\.
- \[5\]M\. Brunsfeld\(2018\)Tree\-sitter: an incremental parsing system\.Note:[https://tree\-sitter\.github\.io/tree\-sitter/](https://tree-sitter.github.io/tree-sitter/)External Links:[Link](https://tree-sitter.github.io/tree-sitter/)Cited by:[§B\.2](https://arxiv.org/html/2605.16829#A2.SS2.p2.1)\.
- \[6\]M\. Cardei, J\. K\. Christopher, T\. Hartvigsen, B\. Kailkhura, and F\. Fioretto\(2025\)Constrained discrete diffusion\.External Links:2503\.09790,[Link](https://arxiv.org/abs/2503.09790)Cited by:[Table 3](https://arxiv.org/html/2605.16829#A1.T3.1.5.1.1.1.2.1.3.1),[Appendix A](https://arxiv.org/html/2605.16829#A1.p3.1),[§2](https://arxiv.org/html/2605.16829#S2.p3.1)\.
- \[7\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba\(2021\)Evaluating large language models trained on code\.External Links:2107\.03374,[Link](https://arxiv.org/abs/2107.03374)Cited by:[§B\.1](https://arxiv.org/html/2605.16829#A2.SS1.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2605.16829#S6.SS1.p2.1)\.
- \[8\]X\. Chen, M\. Lin, N\. Schärli, and D\. Zhou\(2023\)Teaching large language models to self\-debug\.External Links:2304\.05128,[Link](https://arxiv.org/abs/2304.05128)Cited by:[Table 3](https://arxiv.org/html/2605.16829#A1.T3.1.2.1.1.1.2.1.3.1),[Appendix A](https://arxiv.org/html/2605.16829#A1.p2.1),[§B\.3\.3](https://arxiv.org/html/2605.16829#A2.SS3.SSS3.p1.1),[§1](https://arxiv.org/html/2605.16829#S1.p2.1),[§2](https://arxiv.org/html/2605.16829#S2.p2.1),[§6\.2](https://arxiv.org/html/2605.16829#S6.SS2.p2.2)\.
- \[9\]J\. K\. Christopher, M\. Cardei, J\. Liang, and F\. Fioretto\(2025\)Neuro\-symbolic generative diffusion models for physically grounded, robust, and safe generation\.External Links:2506\.01121,[Link](https://arxiv.org/abs/2506.01121)Cited by:[Table 3](https://arxiv.org/html/2605.16829#A1.T3.1.5.1.1.1.2.1.3.1),[Appendix A](https://arxiv.org/html/2605.16829#A1.p3.1),[§2](https://arxiv.org/html/2605.16829#S2.p3.1)\.
- \[10\]S\. Gong, R\. Zhang, H\. Zheng, J\. Gu, N\. Jaitly, L\. Kong, and Y\. Zhang\(2025\)DiffuCoder: understanding and improving masked diffusion models for code generation\.External Links:2506\.20639,[Link](https://arxiv.org/abs/2506.20639)Cited by:[Appendix A](https://arxiv.org/html/2605.16829#A1.p1.1),[§1](https://arxiv.org/html/2605.16829#S1.p1.1),[§2](https://arxiv.org/html/2605.16829#S2.p1.1),[§4](https://arxiv.org/html/2605.16829#S4.p1.3),[§6\.1](https://arxiv.org/html/2605.16829#S6.SS1.p2.1)\.
- \[11\]N\. Gruver, S\. Stanton, N\. C\. Frey, T\. G\. J\. Rudner, I\. Hotzel, J\. Lafrance\-Vanasse, A\. Rajpal, K\. Cho, and A\. G\. Wilson\(2023\)Protein design with guided discrete diffusion\.External Links:2305\.20009,[Link](https://arxiv.org/abs/2305.20009)Cited by:[Table 3](https://arxiv.org/html/2605.16829#A1.T3.1.4.1.1.1.2.1.3.1),[Appendix A](https://arxiv.org/html/2605.16829#A1.p3.1),[§2](https://arxiv.org/html/2605.16829#S2.p3.1)\.
- \[12\]D\. Guo, Q\. Zhu, D\. Yang, Z\. Xie, K\. Dong, W\. Zhang, G\. Chen, X\. Bi, Y\. Wu, Y\. K\. Li, F\. Luo, Y\. Xiong, and W\. Liang\(2024\)DeepSeek\-Coder: when the large language model meets programming – the rise of code intelligence\.External Links:2401\.14196,[Link](https://arxiv.org/abs/2401.14196)Cited by:[§6\.1](https://arxiv.org/html/2605.16829#S6.SS1.p2.1)\.
- \[13\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.External Links:2006\.11239,[Link](https://arxiv.org/abs/2006.11239)Cited by:[Appendix A](https://arxiv.org/html/2605.16829#A1.p1.1)\.
- \[14\]J\. E\. Hu, H\. Khayrallah, R\. Culkin, P\. Xia, T\. Chen, M\. Post, and B\. Van Durme\(2019\-06\)Improved lexically constrained decoding for translation and monolingual rewriting\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 839–850\.External Links:[Link](https://aclanthology.org/N19-1090/),[Document](https://dx.doi.org/10.18653/v1/N19-1090)Cited by:[Table 3](https://arxiv.org/html/2605.16829#A1.T3.1.2.1.1.1.2.1.3.1),[Appendix A](https://arxiv.org/html/2605.16829#A1.p2.1),[§2](https://arxiv.org/html/2605.16829#S2.p2.1)\.
- \[15\]B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Lu, K\. Dang, Y\. Fan, Y\. Zhang, A\. Yang, R\. Men, F\. Huang, B\. Zheng, Y\. Miao, S\. Quan, Y\. Feng, X\. Ren, X\. Ren, J\. Zhou, and J\. Lin\(2024\)Qwen2\.5\-Coder technical report\.External Links:2409\.12186,[Link](https://arxiv.org/abs/2409.12186)Cited by:[§B\.2](https://arxiv.org/html/2605.16829#A2.SS2.p1.1),[§C\.1](https://arxiv.org/html/2605.16829#A3.SS1.p2.9)\.
- \[16\]Y\. Li, D\. Choi, J\. Chung, N\. Kushman, J\. Schrittwieser, R\. Leblond, T\. Eccles, J\. Keeling, F\. Gimeno, A\. Dal Lago, T\. Hubert, P\. Choy, C\. de Masson d’Autume, I\. Babuschkin, X\. Chen, P\. Huang, J\. Welbl, S\. Gowal, A\. Cherepanov, J\. Molloy, D\. J\. Mankowitz, E\. Sutherland Robson, P\. Kohli, N\. de Freitas, K\. Kavukcuoglu, and O\. Vinyals\(2022\)Competition\-level code generation with AlphaCode\.Science378\(6624\),pp\. 1092–1097\.External Links:[Document](https://dx.doi.org/10.1126/science.abq1158),[Link](https://doi.org/10.1126/science.abq1158)Cited by:[§B\.2](https://arxiv.org/html/2605.16829#A2.SS2.p1.1),[§1](https://arxiv.org/html/2605.16829#S1.p1.1)\.
- \[17\]N\. Mündler, J\. Dekoninck, and M\. Vechev\(2025\)Constrained decoding of diffusion LLMs with context\-free grammars\.External Links:2508\.10111,[Link](https://arxiv.org/abs/2508.10111)Cited by:[Table 3](https://arxiv.org/html/2605.16829#A1.T3.1.6.1.1.1.2.1.3.1),[Appendix A](https://arxiv.org/html/2605.16829#A1.p4.1),[§2](https://arxiv.org/html/2605.16829#S2.p3.1),[§6\.1](https://arxiv.org/html/2605.16829#S6.SS1.p2.1)\.
- \[18\]N\. Mündler, J\. He, H\. Wang, K\. Sen, D\. Song, and M\. Vechev\(2025\-06\)Type\-constrained code generation with language models\.Proceedings of the ACM on Programming Languages9\(PLDI\),pp\. 601–626\.External Links:ISSN 2475\-1421,[Link](http://dx.doi.org/10.1145/3729274),[Document](https://dx.doi.org/10.1145/3729274)Cited by:[Table 3](https://arxiv.org/html/2605.16829#A1.T3.1.3.1.1.1.2.1.3.1),[Appendix A](https://arxiv.org/html/2605.16829#A1.p2.1),[§2](https://arxiv.org/html/2605.16829#S2.p2.1)\.
- \[19\]A\. Ni, S\. Iyer, D\. Radev, V\. Stoyanov, W\. Yih, S\. I\. Wang, and X\. V\. Lin\(2023\)LEVER: learning to verify language\-to\-code generation with execution\.External Links:2302\.08468,[Link](https://arxiv.org/abs/2302.08468)Cited by:[Table 3](https://arxiv.org/html/2605.16829#A1.T3.1.2.1.1.1.2.1.3.1),[Appendix A](https://arxiv.org/html/2605.16829#A1.p2.1),[§2](https://arxiv.org/html/2605.16829#S2.p2.1)\.
- \[20\]S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li\(2025\)Large language diffusion models\.External Links:2502\.09992,[Link](https://arxiv.org/abs/2502.09992)Cited by:[§4](https://arxiv.org/html/2605.16829#S4.p1.3)\.
- \[21\]H\. Pearce, B\. Ahmad, B\. Tan, B\. Dolan\-Gavitt, and R\. Karri\(2021\)Asleep at the keyboard? assessing the security of github copilot’s code contributions\.External Links:2108\.09293,[Link](https://arxiv.org/abs/2108.09293)Cited by:[§1](https://arxiv.org/html/2605.16829#S1.p1.1)\.
- \[22\]J\. Peng, L\. Cui, K\. Huang, J\. Yang, and B\. Ray\(2025\)CWEval: outcome\-driven evaluation on functionality and security of LLM code generation\.External Links:2501\.08200,[Document](https://dx.doi.org/10.48550/arXiv.2501.08200),[Link](https://arxiv.org/abs/2501.08200)Cited by:[§B\.1](https://arxiv.org/html/2605.16829#A2.SS1.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.16829#S6.SS2.p2.2)\.
- \[23\]R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer\(2022\)High\-resolution image synthesis with latent diffusion models\.External Links:2112\.10752,[Link](https://arxiv.org/abs/2112.10752)Cited by:[Appendix A](https://arxiv.org/html/2605.16829#A1.p1.1)\.
- \[24\]B\. Rozière, J\. Gehring, F\. Gloeckle, S\. Sootla, I\. Gat, X\. E\. Tan, Y\. Adi, J\. Liu, R\. Sauvestre, T\. Remez, J\. Rapin, A\. Kozhevnikov, I\. Evtimov, J\. Bitton, M\. Bhatt, C\. C\. Ferrer, A\. Grattafiori, W\. Xiong, A\. Défossez, J\. Copet, F\. Azhar, H\. Touvron, L\. Martin, N\. Usunier, T\. Scialom, and G\. Synnaeve\(2024\)Code llama: open foundation models for code\.External Links:2308\.12950,[Link](https://arxiv.org/abs/2308.12950)Cited by:[§6\.1](https://arxiv.org/html/2605.16829#S6.SS1.p2.1)\.
- \[25\]S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. T\. Chiu, A\. Rush, and V\. Kuleshov\(2024\)Simple and effective masked diffusion language models\.External Links:2406\.07524,[Link](https://arxiv.org/abs/2406.07524)Cited by:[Appendix A](https://arxiv.org/html/2605.16829#A1.p1.1),[§4](https://arxiv.org/html/2605.16829#S4.p1.3)\.
- \[26\]Y\. Schiff, S\. S\. Sahoo, H\. Phung, G\. Wang, S\. Boshar, H\. Dalla\-torre, B\. P\. de Almeida, A\. Rush, T\. Pierrot, and V\. Kuleshov\(2025\)Simple guidance mechanisms for discrete diffusion models\.External Links:2412\.10193,[Link](https://arxiv.org/abs/2412.10193)Cited by:[Table 3](https://arxiv.org/html/2605.16829#A1.T3.1.4.1.1.1.2.1.3.1),[Appendix A](https://arxiv.org/html/2605.16829#A1.p3.1),[§2](https://arxiv.org/html/2605.16829#S2.p3.1)\.
- \[27\]C\. Tony, N\. E\. D\. Ferreyra, M\. Mutas, S\. Dhiff, and R\. Scandariato\(2025\)Prompting techniques for secure code generation: a systematic investigation\.External Links:2407\.07064,[Link](https://arxiv.org/abs/2407.07064)Cited by:[§1](https://arxiv.org/html/2605.16829#S1.p2.1)\.
- \[28\]C\. Tony, M\. Mutas, N\. E\. Díaz Ferreyra, and R\. Scandariato\(2023\)LLMSecEval: a dataset of natural language prompts for security evaluations\.In2023 IEEE/ACM 20th International Conference on Mining Software Repositories \(MSR\),pp\. 588–592\.External Links:[Document](https://dx.doi.org/10.1109/MSR59073.2023.00084),[Link](https://doi.org/10.1109/MSR59073.2023.00084)Cited by:[§B\.1](https://arxiv.org/html/2605.16829#A2.SS1.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.16829#S6.SS2.p2.2)\.
- \[29\]Z\. Xie, J\. Ye, L\. Zheng, J\. Gao, J\. Dong, Z\. Wu, X\. Zhao, S\. Gong, X\. Jiang, Z\. Li, and L\. Kong\(2025\)Dream\-Coder 7b: an open diffusion language model for code\.External Links:2509\.01142,[Link](https://arxiv.org/abs/2509.01142)Cited by:[Appendix A](https://arxiv.org/html/2605.16829#A1.p1.1),[§1](https://arxiv.org/html/2605.16829#S1.p1.1),[§2](https://arxiv.org/html/2605.16829#S2.p1.1),[§6\.1](https://arxiv.org/html/2605.16829#S6.SS1.p2.1)\.
- \[30\]F\. Yamaguchi, N\. Golde, D\. Arp, and K\. Rieck\(2014\)Modeling and discovering vulnerabilities with code property graphs\.In2014 IEEE Symposium on Security and Privacy,pp\. 590–604\.External Links:[Document](https://dx.doi.org/10.1109/SP.2014.44),[Link](https://doi.org/10.1109/SP.2014.44)Cited by:[§B\.2](https://arxiv.org/html/2605.16829#A2.SS2.p2.1)\.
- \[31\]J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press\(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.External Links:2405\.15793,[Link](https://arxiv.org/abs/2405.15793)Cited by:[§1](https://arxiv.org/html/2605.16829#S1.p1.1)\.
- \[32\]J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong\(2025\)Dream 7b: diffusion large language models\.External Links:2508\.15487,[Link](https://arxiv.org/abs/2508.15487)Cited by:[Appendix A](https://arxiv.org/html/2605.16829#A1.p1.1),[§4](https://arxiv.org/html/2605.16829#S4.p1.3),[§6\.1](https://arxiv.org/html/2605.16829#S6.SS1.p2.1)\.
- \[33\]F\. Zhang, B\. Chen, Y\. Zhang, J\. Keung, J\. Liu, D\. Zan, Y\. Mao, J\. Lou, and W\. Chen\(2023\)RepoCoder: repository\-level code completion through iterative retrieval and generation\.External Links:2303\.12570,[Link](https://arxiv.org/abs/2303.12570)Cited by:[§1](https://arxiv.org/html/2605.16829#S1.p1.1)\.
- \[34\]Q\. Zheng, X\. Xia, X\. Zou, Y\. Dong, S\. Wang, Y\. Xue, Z\. Wang, L\. Shen, A\. Wang, Y\. Li, T\. Su, Z\. Yang, and J\. Tang\(2023\)CodeGeeX: a pre\-trained model for code generation with multilingual benchmarking on HumanEval\-X\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,KDD ’23,New York, NY, USA,pp\. 5673–5684\.External Links:[Document](https://dx.doi.org/10.1145/3580305.3599790),[Link](https://doi.org/10.1145/3580305.3599790)Cited by:[§B\.1](https://arxiv.org/html/2605.16829#A2.SS1.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2605.16829#S6.SS1.p2.1)\.
## Broader Impact
This work aims to improve the reliability and security of code generation by steering diffusion models toward programs that better satisfy functional and security constraints\. Potential benefits include fewer syntax errors, higher task correctness, and reduced vulnerabilities in generated code\. However, stronger code generation is dual\-use and could be misused if constraint mechanisms are adapted toward harmful objectives\.
## Limitations
CDC depends on the quality and coverage of the feedback mechanisms used during denoising\. Specifically, GradGuide relies on a learned surrogate, which may provide noisy or misleading estimates of functional correctness, while MDFI relies on static analysis, which may miss vulnerabilities or produce false positives\. The empirical evaluation is also limited to a fixed set of benchmarks, languages, and diffusion backbones, so performance may differ for larger codebases, longer\-context tasks, or more complex software\-engineering workflows\. Finally, although CDC performs more localized edits than post\-hoc regeneration, it still adds inference\-time overhead from surrogate evaluation, static analysis, and constraint\-aware correction\.
## Appendix ARelated work \(Extended\)
With the success of diffusion models in continuous domains, especially in image and video generation\[[13](https://arxiv.org/html/2605.16829#bib.bib19),[23](https://arxiv.org/html/2605.16829#bib.bib11)\], recent work has extended diffusion principles to discrete data\. Early work by Austin et al\.\[[1](https://arxiv.org/html/2605.16829#bib.bib3)\]introduced discrete denoising diffusion models, which generalize multinomial diffusion with structured transition operators for discrete state spaces\. Sahoo et al\.\[[25](https://arxiv.org/html/2605.16829#bib.bib2)\]subsequently showed that masked discrete diffusion with an absorbing\-state corruption process can be effective for language modeling, deriving a simplified Rao\-Blackwellized training objective and establishing a strong masked diffusion baseline that approaches autoregressive perplexity\. More recently, these models have been scaled to large language model regimes\. Ye et al\.\[[32](https://arxiv.org/html/2605.16829#bib.bib25)\]introduced Dream 7B, a scalable diffusion large language model that performs competitively with similarly sized autoregressive models while emphasizing parallel refinement, planning, and flexible inference\. In the coding domain, Gong et al\.\[[10](https://arxiv.org/html/2605.16829#bib.bib26)\]developed DiffuCoder, a 7B diffusion language model specialized for code generation, and showed that diffusion\-specific decoding analysis and reinforcement\-learning\-based post\-training can further improve coding performance\. More recently, Xie et al\.\[[29](https://arxiv.org/html/2605.16829#bib.bib24)\]introduced Dream\-Coder 7B, an open discrete diffusion language model for code generation with emergent any\-order generation capabilities, demonstrating that diffusion language models can adapt their decoding behavior across coding tasks while achieving competitive performance on standard code benchmarks\.
Controllable generation is a central challenge in generative AI, particularly for code generation, where outputs are expected not only to be plausible, but also to satisfy validity, correctness, safety, and other program\-level properties\. In autoregressive language models, this has been studied through constrained decoding, reranking, rejection sampling, and iterative self\-correction\[[14](https://arxiv.org/html/2605.16829#bib.bib12),[19](https://arxiv.org/html/2605.16829#bib.bib13)\]\. These methods typically enforce constraints by restricting next\-token decisions, selecting among multiple completed candidates, or repairing an initial output using verifier or execution feedback\. Specific to code generation, this line includes self\-debugging and test\-driven refinement\[[8](https://arxiv.org/html/2605.16829#bib.bib10)\], type\-constrained decoding\[[18](https://arxiv.org/html/2605.16829#bib.bib15)\], and security\-aware code evaluation and repair\[[4](https://arxiv.org/html/2605.16829#bib.bib14)\]\. More broadly, these approaches inherit the structure of autoregressive decoding: early decisions propagate forward, revising a constraint\-relevant region often requires regenerating substantial suffixes, and program\-level constraints are therefore typically handled through token\-local controls or post\-hoc repair\.
In contrast, constraint integration is particularly suitable for discrete diffusion models, as they refine the entire sequence over multiple denoising steps, creating natural opportunities for sequence\-level constraint enforcement during sampling\. This has led to several works on controllable generation\. Schiff et al\.\[[26](https://arxiv.org/html/2605.16829#bib.bib16)\]derive classifier\-based and classifier\-free guidance for discrete diffusion models, adapting standard diffusion guidance mechanisms to discrete state spaces\. For protein design, Gruver et al\.\[[11](https://arxiv.org/html/2605.16829#bib.bib17)\]introduces diffusioN Optimized Sampling \(NOS\), which performs guidance through gradients in the denoiser hidden states and supports constrained sequence optimization\. Other techniques\[[6](https://arxiv.org/html/2605.16829#bib.bib4),[9](https://arxiv.org/html/2605.16829#bib.bib18)\]formulate diffusion sampling as constrained optimization and apply projection\-based updates during reverse denoising\.
Specific to code generation, Mundler et al\.\[[17](https://arxiv.org/html/2605.16829#bib.bib28)\]present a constrained decoding method for diffusion LLMs under context\-free grammars, showing that syntax\-level constraints for programming languages can be enforced during out\-of\-order denoising with improved syntactic correctness\. The work presented in this paper is most closely related to this line of research, but targets a harder code\-generation setting in which constraints are semantic and security\-related\. In this setting, prior guidance\-based and syntax\-level constrained decoding methods are not sufficient, as program semantic correctness and security generally cannot be captured by local token preferences or grammar constraints alone\.
Table[3](https://arxiv.org/html/2605.16829#A1.T3)summarizes the closest lines of work and the dimensions that most directly separate CDC from prior approaches\.
ApproachBaseWhenConstraint ScopeSemantic /VerifierAR constrained decoding,reranking, self\-correction\[[14](https://arxiv.org/html/2605.16829#bib.bib12),[19](https://arxiv.org/html/2605.16829#bib.bib13),[8](https://arxiv.org/html/2605.16829#bib.bib10),[4](https://arxiv.org/html/2605.16829#bib.bib14)\]ARtoken /posttests, execution,repair feedbackpost\-hoconlyType\-constraineddecoding\[[18](https://arxiv.org/html/2605.16829#bib.bib15)\]ARtokensyntax,typesNoDiscrete diffusionguidance\[[26](https://arxiv.org/html/2605.16829#bib.bib16),[11](https://arxiv.org/html/2605.16829#bib.bib17)\]Diff\.denoisesoft or gradient\-based objectivesLimitedOptimization\-basedconstrained diffusion\[[6](https://arxiv.org/html/2605.16829#bib.bib4),[9](https://arxiv.org/html/2605.16829#bib.bib18)\]Diff\.denoiseprojection /optimizationLimitedCFG\-constrained diffusionfor code\[[17](https://arxiv.org/html/2605.16829#bib.bib28)\]Diff\.denoisesyntax\(CFG\)No\\rowcolorblue\!6CDC \(ours\)Diff\.denoisesyntax, semantics,verifier feedbackYesTable 3:Comparison of CDC with related approaches\.
## Appendix BExtended Experimental Details and Results
### B\.1Benchmarks
##### Syntax and functionality\.
C\+\+ program synthesis is evaluated with HumanEval\-X C\+\+ and MBPP\-C\+\+\[[7](https://arxiv.org/html/2605.16829#bib.bib20),[2](https://arxiv.org/html/2605.16829#bib.bib22),[34](https://arxiv.org/html/2605.16829#bib.bib21)\]\. HumanEval\-X extends HumanEval to multiple programming languages with hand\-written reference solutions and execution tests; we use its 164 C\+\+ tasks\. MBPP\-C\+\+ contains 397 C\+\+ translations of MBPP\-style programming problems\. For each problem, the model receives the natural\-language description and function signature, and must generate a complete solution\.
We report compile success \(Syntax\\mathrm\{Syntax\}\), functionalpass@1\\mathrm\{pass@1\}, andpass@10\\mathrm\{pass@10\}\. A sample is functionally correct if it compiles and passes all unit tests\. Forpass@1\\mathrm\{pass@1\}, every model is run deterministically at temperature0with seed0,256256denoising / decoding steps, and a maximum of512512generated tokens\. Forpass@10\\mathrm\{pass@10\}we draw1010independent samples \(seeds0…90\\\!\\dots\\\!9, otherwise the same step and length budget\) and report the unbiased estimator of ,pass@k=𝔼\[1−\(n−ck\)/\(nk\)\]\\mathrm\{pass@k\}=\\mathbb\{E\}\\bigl\[1\-\\binom\{n\-c\}\{k\}/\\binom\{n\}\{k\}\\bigr\], withn=10n\\\!=\\\!10andccthe number of correct samples per task\.
##### Security\.
Security\-oriented code generation is evaluated on CWEval and LLMSecEval\[[22](https://arxiv.org/html/2605.16829#bib.bib31),[28](https://arxiv.org/html/2605.16829#bib.bib32)\]\. CWEval additionally evaluates functionality and security on the same programming tasks, making it suitable for further testing on methods impact on correctness with security objectives\. This dataset contains 108\-prompt CWEval suite spanning C, C\+\+, Go, JavaScript, and Python\. LLMSecEval contains natural\-language prompts associated with common CWE patterns; LLMSecEval\+ is constructed by adding executable functional and security oracles and filtering prompts for which these oracles can be run reliably, resulting in prompts\.
For security, we report three pass@1 metrics\. LetFiF\_\{i\}indicate that sampleiipasses the functional oracle, and letSiS\_\{i\}indicate that it passes the security oracle\. Then
func@1=𝔼\[Fi\],sec@1=𝔼\[Si\],func\-sec@1=𝔼\[FiSi\]\.\\mathrm\{func@1\}=\\mathbb\{E\}\[F\_\{i\}\],\\qquad\\mathrm\{sec@1\}=\\mathbb\{E\}\[S\_\{i\}\],\\qquad\\mathrm\{func\\mbox\{\-\}sec@1\}=\\mathbb\{E\}\[F\_\{i\}S\_\{i\}\]\.The joint metricfunc\-sec@1\\mathrm\{func\\mbox\{\-\}sec@1\}is our primary security metric, since generated code must be both secure and functionally correct\.
### B\.2Implementation Details
GradGuide\.GradGuide uses Qwen2\.5\-1\.5B surrogate\[[15](https://arxiv.org/html/2605.16829#bib.bib34)\]trained on CodeContest\-derived programs\[[16](https://arxiv.org/html/2605.16829#bib.bib23)\]with execution labels\. The surrogate predicts a continuous correctness score from soft token embeddings and is used for both localization and ALM correction\. We apply GradGuide at each denoising step, and we use adaptive editing: saliency selects token spans, spans are expanded to local syntactic neighborhoods, and high\-risk spans are remasked before the next reverse transition\.
MDFI\.MDFI invokes the static\-analysis evaluator at fixed denoising checkpoints\. The analyzer constructs a tolerant partial Code Property Graph\[[30](https://arxiv.org/html/2605.16829#bib.bib29)\]using Tree\-sitter\-style incremental parsing\[[5](https://arxiv.org/html/2605.16829#bib.bib30)\], extracts dataflow witnesses with bounded source\-to\-sink search, obtains structural risk with the product FSM when exact labels are masked, selects semantic neighborhoods from the AST under a token budget, and injects compact remediation feedback into a preallocated prompt buffer\.
Efficiency measurements\.We measure both outcome quality and corrective cost\. For GradGuide, we report total tokens generated by the model, tokens per instance, tokens per passing solution, tokens rewritten per failed instance, and fraction of the program body edited\. For MDFI, we additionally measure edit span locality and number of edit clusters\. AR re\-prompting is counted as regenerating the full program body, while CDC counts only the regions that are remasked or inserted during the denoising trajectory\.
### B\.3Additional Experimental Results
Figure 4:Efficiency and locality of CDC vs\. AR\+RepromptTable 4:Per\-languagefs@1on CWEval \(108 prompts total\)\. Parenthesized values give the absolute change relative to the corresponding model’s Vanilla row:yellowmarks an improvement of<10<\\\!10points,greenmarks an improvement of≥10\\geq\\\!10points, andredmarks a regression\. Within each model, the strongest value per language isbolded\.ModelMethodpy \(25\)js \(23\)c \(20\)cpp \(21\)go \(19\)ALL \(108\)DiffusionVanilla24\.013\.00\.014\.35\.312\.0\+ sec prompt28\.0\(\+4\)17\.4\(\+4\)0\.0\(\+0\)14\.3\(\+0\)5\.3\(\+0\)13\.9\(\+2\)MDFI \(Ours\)72\.0\(\+48\)34\.8\(\+22\)5\.0\(\+5\)28\.6\(\+14\)21\.1\(\+16\)34\.3\(\+22\)ARVanilla44\.021\.710\.014\.310\.521\.3Reprompt64\.0\(\+20\)43\.5\(\+22\)25\.0\(\+15\)33\.3\(\+19\)26\.3\(\+16\)39\.8\(\+19\)
In this section we present additional detailed results\.
Figure 5:CWEval efficiency comparison between AR re\-prompting and MDFI\. MDFI lowers pipeline token cost, edited\-token count, edit span, and edit clusters by repairing localized vulnerable regions rather than regenerating the entire program\.Figure 6:Per\-language efficiency means on CWEval\. MDFI substantially reduces edited tokens, edit span, and edit clusters across C, C\+\+, Go, JavaScript, and Python\.#### B\.3\.1CDC is substantially more local and token\-efficient than AR re\-prompting
Figure[4](https://arxiv.org/html/2605.16829#A2.F4)compares GradGuide to AR plus one re\-prompt on HumanEval\-X C\+\+ and MBPP\-C\+\+\. On HumanEval\-X, AR plus one re\-prompt generates 35,640 total model tokens, whereas GradGuide generates 26,723 tokens, a0\.75×0\.75\\timescost ratio and a savings of 8,917 tokens\. On MBPP\-C\+\+, AR plus one re\-prompt generates 89,456 tokens, whereas GradGuide generates 38,676 tokens, a0\.43×0\.43\\timescost ratio and a savings of 50,780 tokens\.
Normalizing by problem count and by successful solution gives the same conclusion\. On HumanEval\-X, GradGuide uses 163 tokens per instance and 265 tokens per passing solution, compared with 217 and 356 for AR re\-prompting\. On MBPP\-C\+\+, GradGuide uses 97 tokens per instance and 165 tokens per passing solution, compared with 225 and 376 for AR re\-prompting\. Thus the efficiency gain is not merely due to differences in benchmark size; it persists when normalized by both instances and successful outputs\.
The locality analysis explains the efficiency gain\. For failed instances that require correction, AR re\-prompting corrects a median of 123 tokens on HumanEval\-X and 98 tokens on MBPP\-C\+\+, while GradGuide edits medians of 32 and 35 tokens, respectively\. In other words, GradGuide is about3×3\\timesmore surgical on HumanEval\-X and2×2\\timesmore surgical on MBPP\-C\+\+\. At the body\-fraction level, AR re\-prompting rewrites essentially the entire program, whereas GradGuide concentrates edits on localized regions identified by surrogate saliency\. This supports the focused\-editing hypothesis: CDC achieves its gains by revising the parts of the denoising trajectory most relevant to the constraint, not by repeatedly sampling complete programs\.
Figure[5](https://arxiv.org/html/2605.16829#A2.F5)compares MDFI with AR re\-prompting on CWEval\. AR re\-prompting produces roughly a full second program, with a mean pipeline cost around3\.1×1023\.1\\times 10^\{2\}tokens\. MDFI reduces the mean pipeline cost to roughly2\.1×1022\.1\\times 10^\{2\}tokens\. The difference is larger for edited tokens: AR re\-prompting edits on the order of an entire solution, while MDFI edits a small localized span\. Median edit span drops from 0\.89 of the program body under AR re\-prompting to approximately below 0\.1 under MDFI, and median edit clusters drop from 4 to 1\.
The per\-language analysis in Figure[6](https://arxiv.org/html/2605.16829#A2.F6)confirms that locality is robust across languages\. MDFI reduces edited tokens by 94% for C, 93% for C\+\+, 93% for Go, 63% for JavaScript, and 65% for Python\. Edit\-span reductions are similarly large: 94% for C, 93% for C\+\+, 91% for Go, 72% for JavaScript, and 61% for Python\. Edit\-cluster reductions are strongest in C/C\+\+/Go, where vulnerable patterns tend to be localized around library calls or pointer/string operations; they are smaller in Python and JavaScript, where secure fixes often require changing a higher\-level API pattern across multiple nearby tokens\. Even in those languages, MDFI remains substantially more local than whole\-program re\-prompting\.
#### B\.3\.2Ablations
##### GradGuide components\.
Figure[7](https://arxiv.org/html/2605.16829#A2.F7)\(a\) reports per\-configuration pass@1 on HumanEval\-X C\+\+ \(Dream\-Coder 7B, greedy\): vanilla34\.134\.1, ALM\-only40\.240\.2, adaptive editing\-only50\.650\.6, full operator65\.265\.2; random editing matches vanilla, so localization must be constraint\-steered\. Panel \(b\) decomposes the full\-operator gain: ALM\-only \(\+6\.1\+6\.1\) and adaptive editing\-only \(\+16\.5\+16\.5\) predict56\.756\.7under independent action; the full operator reaches65\.265\.2, a\+8\.5\+8\.5pp super\-additive synergyfrom ALM shaping the local distribution*inside*the region adaptive editing reopens\.
Figure 7:Component and localization\-choice ablation of CDC on HumanEval\-X C\+\+ \(164164tasks, Dream\-Coder 7B, greedy\)\.\(a\)Per\-configuration pass@1\.\(b\)Component composition: ALM\-only\+6\.1\+6\.1pp and adaptive editing\-only\+16\.5\+16\.5pp predict56\.7%56\.7\\%under independent action; the full operator reaches65\.2%65\.2\\%, a\+8\.5\+8\.5pp super\-additive synergy\.
##### MDFI insertion budget and neighborhood scope\.
Figure[8](https://arxiv.org/html/2605.16829#A2.F8)\(a\): mask insertionKKplateaus atK∈\[8,12\]K\\\!\\in\\\!\[8,12\]on both benchmarks \(CWEval34\.334\.3, LLMSecEval\+24\.724\.7\);K\>16K\\\!\>\\\!16regresses functionality\. Panel \(b\): the deployed Parent\+\+Leaf neighborhood \(34\.334\.3\) beats tighter \(Token\-Window24\.124\.1\) and looser \(Use–Def Slice26\.926\.9\) alternatives\.
Figure 8:MDFI scope ablation\.\(a\)Insertion amountKKon CWEval and LLMSecEval\+; func\-sec@1 plateaus atK∈\[8,12\]K\\\!\\in\\\!\[8,12\]\.\(b\)Remasking neighborhood scope on CWEval; the deployed Parent\+\+Leaf rule peaks at34\.3%34\.3\\%, beating both Token\-Window and broader \(Use–Def Slice\) alternatives\.
#### B\.3\.3CDC reduces correction\-induced regressions
To complement the headline pass@kkrates, we characterize each method’s behavior as a*corrector*: how it converts vanilla outputs into corrected outputs at the per\-prompt level\. Following standard practice for evaluating verifier\- or test\-feedback\-driven correctors\[[8](https://arxiv.org/html/2605.16829#bib.bib10)\], we report four complementary measurements that probe different facets of corrective behavior, summarized in Table[5](https://arxiv.org/html/2605.16829#A2.T5): improvement rate \(fraction of base\-failing prompts recovered\), the raw improvement\-to\-regression count, net change, and constructive ratio \(the share of all verdict changes that are improvements\)\. Together they characterize whether a method recovers failures at scale, whether it does so with high precision, and whether its action is closer to a monotone corrector or to coin\-flip churn\.
Table 5:Corrective effectiveness on functional benchmarks\. For each method we report four complementary per\-prompt measurements over the subset of prompts:improvement rate\(↑\\uparrow\), the fraction of base\-failing prompts the method recovers;Imp\. / Reg\.\(↑\\uparrow\), the raw counts of improvements and regressions induced;netchange \(↑\\uparrow\), improvements minus regressions; andconstructive ratio\(↑\\uparrow\), the share of all verdict changes that are improvements \(100%100\\%corresponds to a strictly monotone corrector\)\. For CFG\-CD and CDC, the reference vanilla is Dream\-Coder 7B; for AR\+Reprompt, the reference is DeepSeek\-Coder\-Instruct\-6\.7B\. Best per row isbolded\. CDC is the strongest corrector on every measurement on both benchmarks: it recovers48\.0%/43\.8%48\.0\\%/43\.8\\%of base\-failing programs \(vs\. AR\+Reprompt’s28\.8%/4\.4%28\.8\\%/4\.4\\%\), delivers\+46/\+120\+46/\+120net improvements \(vs\.\+2/\+1\+2/\+1\), and converts9494–97%97\\%of its verdict changes into fixes \(vs\. AR\+Reprompt’s near\-coin\-flip∼53%\\sim\\\!53\\%\)\. CFG\-CD recovers*zero*failing programs on MBPP\-C\+\+\.AngleBenchmarkCFG\-CDAR\+RepromptCDC \(Ours\)Improvement rate↑\\uparrowHE\-X C\+\+13\.7%28\.8%48\.0%MBPP\-C\+\+0\.0%4\.4%43\.8%Imp\. / Reg\. \(counts\)HE\-X C\+\+14 / 1619 / 1749 / 3MBPP\-C\+\+0 / 87 / 6124 / 4Net↑\\uparrowHE\-X C\+\+\-2\+2\+46MBPP\-C\+\+\-8\+1\+120Constructive ratio↑\\uparrowHE\-X C\+\+46\.7%52\.8%94\.2%MBPP\-C\+\+0\.0%53\.8%96\.9%
## Appendix CDetails of GradGuide
This appendix provides the implementation details of the surrogate\-gradient operator used in Section[5\.2](https://arxiv.org/html/2605.16829#S5.SS2)\. GradGuide instantiates the constraint\-aware operator𝒫𝒞\\mathcal\{P\}\_\{\\mathcal\{C\}\}for functional correctness and syntactic validity\. Its goal is to provide differentiable denoising\-time feedback for constraints that are ultimately evaluated by non\-differentiable program oracles, such as compilation and unit\-test execution\.
### C\.1Surrogate Model
Program\-level correctness is non\-differentiable: a candidate program either compiles and passes its tests, or it does not\. GradGuide therefore trains an auxiliary surrogategϕg\_\{\\phi\}ahead of time and uses it only at inference time\. The diffusion model parametersθ\\thetaare never updated\.
At reverse steptt, the surrogate maps the clean\-state proposal𝒙^0\(t\)∈ΔL×\|𝒱\|\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\\in\\Delta^\{L\\times\|\\mathcal\{V\}\|\}and task contextccto one or more continuous correctness scores,
gϕ,j\(𝒙^0\(t\),c\)∈\[0,1\],j∈\{1,…,m\}\.g\_\{\\phi,j\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)\\in\[0,1\],\\qquad j\\in\\\{1,\\dots,m\\\}\.\(19\)To make the surrogate differentiable with respect to token distributions, each soft token distribution is converted into a soft embedding,
Emb\(𝒙^0\(t\)\)i=x^0\(t\),i𝐄tok,i∈\[L\],\\mathrm\{Emb\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\)^\{i\}=\\hat\{x\}\_\{0\}^\{\(t\),i\}\\,\\mathbf\{E\}\_\{\\mathrm\{tok\}\},\\qquad i\\in\[L\],\(20\)wherex^0\(t\),i∈Δ\|𝒱\|\\hat\{x\}\_\{0\}^\{\(t\),i\}\\in\\Delta^\{\|\\mathcal\{V\}\|\}is the per\-position distribution \(theii\-th row of𝒙^0\(t\)\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\) and𝐄tok∈ℝ\|𝒱\|×d\\mathbf\{E\}\_\{\\mathrm\{tok\}\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\}is the surrogate input embedding matrix\. The surrogate is trained using execution\-derived labels, including compile success and test\-pass outcomes\. In our implementation, we fine\-tune a Qwen2\.5\-Coder\-1\.5B\-Instruct backbone\[[15](https://arxiv.org/html/2605.16829#bib.bib34)\]with a small regression head on the last\-token hidden state, using CodeContest\-derived\(problem,code\)\(\\text\{problem\},\\text\{code\}\)pairs labeled byy=𝟙\[compiles\]⋅\(test\-pass ratio\)∈\[0,1\]y\\\!=\\\!\\mathbb\{1\}\[\\text\{compiles\}\]\\cdot\(\\text\{test\-pass ratio\}\)\\in\[0,1\]from real C\+\+ execution; class\-balanced binary cross\-entropy is optimized over the head with a low\-rank LoRA adapter on the backbone\. The model is conditioned on a fixed judge prompt prepended to each input – “*You are a strict unit\-test judge\. Given a programming problem and a solution, assign a functionality correctness score from 0 to 1: 0 = completely wrong or non\-functional, 1 = fully correct and passes all intended behavior\.\\n\\nProblem:\\n*” followed by the problem text and the candidate code \(token IDs at training time, soft embeddings at inference\) – so that the same prompt is used for training the regression head, for localization, and for proposal correction\.
For each constraintjj, we define the surrogate\-relaxed violation
Δgj\(𝒙^0\(t\),c\)=max\(0,τj−gϕ,j\(𝒙^0\(t\),c\)\),\\Delta g\_\{j\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)=\\max\\\!\\bigl\(0,\\tau\_\{j\}\-g\_\{\\phi,j\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)\\bigr\),\(21\)whereτj∈\(0,1\]\\tau\_\{j\}\\in\(0,1\]is the target satisfaction threshold\. The aggregate surrogate violation is
ΔG\(𝒙^0\(t\),c\)=∑j=1mΔgj\(𝒙^0\(t\),c\)\.\\Delta G\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)=\\sum\_\{j=1\}^\{m\}\\Delta g\_\{j\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)\.\(22\)The surrogate functionsgϕ,jg\_\{\\phi,j\},Emb\\mathrm\{Emb\},Δgj\\Delta g\_\{j\}, andΔG\\Delta Gare well\-defined on any soft proposal inΔL×\|𝒱\|\\Delta^\{L\\times\|\\mathcal\{V\}\|\}; we have written them at𝒙^0\(t\)\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}above for clarity, and Mode A below reuses them with its inner\-loop iterate in place of𝒙^0\(t\)\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\.
### C\.2Gradient\-Based Localization
GradGuide computes the gradient of the aggregate surrogate violation with respect to each soft token embedding, evaluated at the clean\-state proposal𝒙^0\(t\)\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}:
𝒈i\(𝒙^0\(t\),c\)=∇Emb\(𝒙^0\(t\)\)iΔG\(𝒙^0\(t\),c\),i∈\[L\]\.\\bm\{g\}\_\{i\}\\\!\\bigl\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\\bigr\)=\\nabla\_\{\\mathrm\{Emb\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\)^\{i\}\}\\Delta G\\\!\\bigl\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\\bigr\),\\qquad i\\in\[L\]\.\(23\)The norm‖𝒈i\(𝒙^0\(t\),c\)‖2\\\|\\bm\{g\}\_\{i\}\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)\\\|\_\{2\}measures how sensitive the predicted violation is to changes at positionii\. GradGuide combines this violation sensitivity with uncertainty signals from the base denoiser:
ai=‖𝒈i\(𝒙^0\(t\),c\)‖2⏟violation sensitivity\+αHH\(x^0\(t\),i\)\+αC\(1−maxv∈𝒱x^0\(t\),i\(v\)\),a\_\{i\}=\\underbrace\{\\bigl\\\|\\bm\{g\}\_\{i\}\\\!\\bigl\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\\bigr\)\\bigr\\\|\_\{2\}\}\_\{\\text\{violation sensitivity\}\}\+\\alpha\_\{H\}H\\\!\\bigl\(\\hat\{x\}\_\{0\}^\{\(t\),i\}\\bigr\)\+\\alpha\_\{C\}\\\!\\left\(1\-\\max\_\{v\\in\\mathcal\{V\}\}\\hat\{x\}\_\{0\}^\{\(t\),i\}\(v\)\\right\),\(24\)whereH\(x^0\(t\),i\)H\(\\hat\{x\}\_\{0\}^\{\(t\),i\}\)is the entropy of the base proposal at positionii, andαH,αC≥0\\alpha\_\{H\},\\alpha\_\{C\}\\geq 0weight entropy and confidence terms\. Importantly, the saliency is computed before any correction has been applied, so all three terms are anchored at𝒙^0\(t\)\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\. GradGuide then selects the top\-kksaliency positions and expands them to syntactically coherent neighborhoods:
𝒮t=Expand\(Top\-k\(a1,…,aL\)\)\.\\mathcal\{S\}\_\{t\}=\\mathrm\{Expand\}\\\!\\left\(\\mathrm\{Top\}\\text\{\-\}k\(a\_\{1\},\\dots,a\_\{L\}\)\\right\)\.\(25\)The expansion operation maps isolated token positions to coherent edit regions, such as the enclosing line, expression, block, or delimiter\-balanced span\. This prevents GradGuide from editing individual tokens in ways that break local syntax\.
### C\.3Mode A: KL\-Anchored Augmented\-Lagrangian Projection
The first correction mode shifts the clean\-state proposal toward satisfying the surrogate constraints while preserving proximity to the base denoiser\. Given the localized edit set𝒮t\\mathcal\{S\}\_\{t\}, GradGuide introduces an auxiliary optimization variable𝒚∈ΔL×\|𝒱\|\\bm\{y\}\\in\\Delta^\{L\\times\|\\mathcal\{V\}\|\}, initialized at the clean\-state proposal𝒙^0\(t\)\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}, and approximately solves
𝒚t≈argmin𝒚∈ΔL×\|𝒱\|ℒALM\(𝒚;𝒙^0\(t\),𝒮t,c\),\\bm\{y\}\_\{t\}\\approx\\arg\\min\_\{\\bm\{y\}\\in\\Delta^\{L\\times\|\\mathcal\{V\}\|\}\}\\mathcal\{L\}\_\{\\mathrm\{ALM\}\}\\bigl\(\\bm\{y\};\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},\\mathcal\{S\}\_\{t\},c\\bigr\),\(26\)where
ℒALM\(𝒚;𝒙^0\(t\),𝒮t,c\)=DKL\(𝒚∥𝒙^0\(t\)\)\+∑j=1m\[λjΔgj\(𝒚,c\)\+μj2Δgj\(𝒚,c\)2\]\+β∑i∉𝒮tDKL\(𝒚i∥x^0\(t\),i\)\.\\mathcal\{L\}\_\{\\mathrm\{ALM\}\}\\bigl\(\\bm\{y\};\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},\\mathcal\{S\}\_\{t\},c\\bigr\)=D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\bm\{y\}\\,\\\|\\,\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\\right\)\+\\sum\_\{j=1\}^\{m\}\\left\[\\lambda\_\{j\}\\Delta g\_\{j\}\(\\bm\{y\},c\)\+\\frac\{\\mu\_\{j\}\}\{2\}\\Delta g\_\{j\}\(\\bm\{y\},c\)^\{2\}\\right\]\+\\beta\\sum\_\{i\\notin\\mathcal\{S\}\_\{t\}\}D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\bm\{y\}^\{i\}\\,\\\|\\,\\hat\{x\}\_\{0\}^\{\(t\),i\}\\right\)\.\(27\)The first term is a trust region that anchors the corrected proposal to the base denoiser distribution\. The second term is an augmented\-Lagrangian penalty that drives surrogate violation toward zero\. The final term is a locality anchor that discourages changes outside the localized edit region\. In the limit of largeβ\\beta, the update is effectively restricted to𝒮t\\mathcal\{S\}\_\{t\}\.
In implementation, we parameterize𝒚=softmax\(u\)\\bm\{y\}=\\mathrm\{softmax\}\(u\)with unconstrained logitsu∈ℝL×\|𝒱\|u\\in\\mathbb\{R\}^\{L\\times\|\\mathcal\{V\}\|\}, initialize
u\(0\)=log\(𝒙^0\(t\)\+ε\),u^\{\(0\)\}=\\log\\\!\\left\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\+\\varepsilon\\right\),\(28\)and takeKinnerK\_\{\\mathrm\{inner\}\}first\-order steps on Eq\.[27](https://arxiv.org/html/2605.16829#A3.E27)\. The corrected proposal after these steps is denoted𝒚t\\bm\{y\}\_\{t\}\.
The multipliers and penalties are updated using an augmented\-Lagrangian rule:
λj←\[λj\+μjΔgj\(𝒙¯0\(t\),c\)\]\+,\\lambda\_\{j\}\\leftarrow\\bigl\[\\lambda\_\{j\}\+\\mu\_\{j\}\\Delta g\_\{j\}\(\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)\\bigr\]\_\{\+\},\(29\)and
μj←\{ρμj,Δgj\(𝒙¯0\(t\),c\)≥ϑΔgjprev,μj,otherwise,\\mu\_\{j\}\\leftarrow\\begin\{cases\}\\rho\\mu\_\{j\},&\\Delta g\_\{j\}\(\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)\\geq\\vartheta\\Delta g\_\{j\}^\{\\mathrm\{prev\}\},\\\\\[2\.0pt\] \\mu\_\{j\},&\\text\{otherwise\},\\end\{cases\}\(30\)whereρ\>1\\rho\>1is the penalty growth factor andϑ∈\(0,1\)\\vartheta\\in\(0,1\)is a progress tolerance\. The update uses the surrogate violation on the argmax\-decoded intermediate program𝒙¯0\(t\)\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}, rather than only the soft proposal𝒚\\bm\{y\}, to reduce drift between soft surrogate satisfaction and decoded\-program feasibility\.
Mode A is gated by the surrogate score\. If the decoded intermediate program already satisfies the surrogate threshold,
gϕ\(𝒙¯0\(t\),c\)≥τalm,g\_\{\\phi\}\(\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)\\geq\\tau\_\{\\mathrm\{alm\}\},\(31\)then GradGuide skips the inner optimization and returns𝒚t=𝒙^0\(t\)\\bm\{y\}\_\{t\}=\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}\. Otherwise, the KL\-anchored correction is applied\.
### C\.4Mode B: Constraint\-Triggered Local Remasking
The second correction mode reopens already committed tokens for re\-denoising\. This is necessary because, under the masked diffusion reverse kernel, oncexti≠\[MASK\]x\_\{t\}^\{i\}\\neq\\texttt\{\[MASK\]\}, a proposal correction at positioniicannot directly rewrite the committed token\. Mode B therefore remasks localized positions when the intermediate program is sufficiently decoded and still violates the target constraint\.
Let
nmask\(𝒙t\)=\|\{i∈\[L\]:xti=\[MASK\]\}\|n\_\{\\mathrm\{mask\}\}\(\\bm\{x\}\_\{t\}\)=\\bigl\|\\\{i\\in\[L\]:x\_\{t\}^\{i\}=\\texttt\{\[MASK\]\}\\\}\\bigr\|\(32\)denote the number of masked positions at timesteptt, and letbtb\_\{t\}denote the number of edits already used along the trajectory\. Mode B activates when
nmask\(𝒙t\)≤m⋆,bt<B,Satisfies\(𝒙¯0\(t\),c\)=𝚏𝚊𝚕𝚜𝚎\.n\_\{\\mathrm\{mask\}\}\(\\bm\{x\}\_\{t\}\)\\leq m\_\{\\star\},\\qquad b\_\{t\}<B,\\qquad\\mathrm\{Satisfies\}\(\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)=\\mathtt\{false\}\.\(33\)Here,m⋆m\_\{\\star\}is a mask\-count threshold ensuring that the partial program is sufficiently decoded for evaluation, andBBis the global edit budget\. The predicateSatisfies\\mathrm\{Satisfies\}can be an exact oracle when available, such as compilation, unit\-test execution, or static analysis\. When exact evaluation is not available on the partial program, GradGuide uses the surrogate conditiongϕ\(𝒙¯0\(t\),c\)≥τg\_\{\\phi\}\(\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},c\)\\geq\\tau\.
When Eq\.[33](https://arxiv.org/html/2605.16829#A3.E33)is satisfied, GradGuide constructs a reopened state
xt⋆,i=\{\[MASK\],i∈𝒮t,xti,i∉𝒮t\.x\_\{t\}^\{\\star,i\}=\\begin\{cases\}\\texttt\{\[MASK\]\},&i\\in\\mathcal\{S\}\_\{t\},\\\\ x\_\{t\}^\{i\},&i\\notin\\mathcal\{S\}\_\{t\}\.\\end\{cases\}\(34\)The reverse chain then continues from𝒙t⋆\\bm\{x\}\_\{t\}^\{\\star\}under the constrained reverse kernel, with Mode A active during the refill steps\. The edit budget is incremented after the intervention\.
### C\.5Composition of the Two Modes
GradGuide uses a single surrogate signal to produce two coupled outputs at each reverse step: a corrected proposal𝒚t\\bm\{y\}\_\{t\}from Mode A and, when triggered, a reopened state𝒙t⋆\\bm\{x\}\_\{t\}^\{\\star\}from Mode B\. The constrained reverse kernel then advances the chain using the pair\(𝒙t⋆,𝒚t\)\(\\bm\{x\}\_\{t\}^\{\\star\},\\bm\{y\}\_\{t\}\)\. If Mode B does not activate,𝒙t⋆=𝒙t\\bm\{x\}\_\{t\}^\{\\star\}=\\bm\{x\}\_\{t\}\.
The two modes address complementary failure cases\. Mode A shifts probability mass toward feasible tokens while the position remains editable through the proposal distribution\. Mode B reopens tokens that have already been committed, allowing the denoiser to repair localized regions with the benefit of the surrounding program context\. Both modes are scoped by the same localization set𝒮t\\mathcal\{S\}\_\{t\}and driven by the same surrogate gradient\.
The main configurations are recovered by different choices of the projection threshold and edit budget:
τalm=0,B=0\\displaystyle\\tau\_\{\\mathrm\{alm\}\}=0,\\;B=0recovers the unconstrained reverse process,\\displaystyle\\quad\\text\{recovers the unconstrained reverse process\},τalm\>0,B=0\\displaystyle\\tau\_\{\\mathrm\{alm\}\}\>0,\\;B=0uses only KL\-anchored proposal correction,\\displaystyle\\quad\\text\{uses only KL\-anchored proposal correction\},τalm=0,B\>0\\displaystyle\\tau\_\{\\mathrm\{alm\}\}=0,\\;B\>0uses only constraint\-triggered remasking,\\displaystyle\\quad\\text\{uses only constraint\-triggered remasking\},τalm\>0,B\>0\\displaystyle\\tau\_\{\\mathrm\{alm\}\}\>0,\\;B\>0uses the full GradGuide operator\.\\displaystyle\\quad\\text\{uses the full GradGuide operator\}\.The deployed configuration uses both modes\. Its per\-step cost is at most one surrogate forward/backward pass,KinnerK\_\{\\mathrm\{inner\}\}inner optimization steps when Mode A is active, andKeditK\_\{\\mathrm\{edit\}\}additional reverse steps when Mode B activates\.
## Appendix DDetails of MDFI
This appendix provides the implementation details of the static\-analysis\-guided operator used in Section[5\.2](https://arxiv.org/html/2605.16829#S5.SS2)\. MDFI instantiates the constraint\-aware operator𝒫𝒞\\mathcal\{P\}\_\{\\mathcal\{C\}\}for security constraints\. Its goal is to provide non\-differentiable denoising\-time feedback for properties that are characterized by discrete syntactic and dataflow patterns rather than smooth correctness scores\. The base diffusion parametersθ\\thetaare not modified, the analyzer carries no learnable parameters, and the same partial\-program analysis pipeline is reused across all benchmarks, languages, and decoding configurations: MDFI is therefore training\-free\.
### D\.1Partial Program Representation
At each fired checkpoint, the analyzer builds a structural representation𝒢t\\mathcal\{G\}\_\{t\}from the decoded clean\-state proposal𝒙¯0\(t\)\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}defined in Eq\.[4](https://arxiv.org/html/2605.16829#S5.E4)\. The representation is a partial program graph that combines an abstract syntax tree with a dataflow approximation:𝒢t=\(V,EAST,ECFG,EDFG,ℓ\)\\mathcal\{G\}\_\{t\}=\(V,\\,E^\{\\mathrm\{AST\}\},\\,E^\{\\mathrm\{CFG\}\},\\,E^\{\\mathrm\{DFG\}\},\\,\\ell\), whereVVare program nodes \(statements, expressions, identifiers\),EASTE^\{\\mathrm\{AST\}\}encodes parent–child structure,ECFGE^\{\\mathrm\{CFG\}\}encodes control flow between adjacent basic blocks within each function,EDFGE^\{\\mathrm\{DFG\}\}encodes intra\-procedural dataflow between defining and using occurrences of each program identifier, andℓ\\elllabels each node with its lexical class\.
Tokens in𝒙t\\bm\{x\}\_\{t\}that remain masked at checkpoint time are not carried into the parser literally; instead, each mask is rewritten into a fresh placeholder identifier of the form\_\_𝚑𝚘𝚕𝚎\_<𝚒\>\_\_\\mathtt\{\\\_\\\_hole\\\_<i\>\\\_\\\_\}before parsing\. This rewrite preserves the lexical class of the position—an identifier slot remains an identifier, a literal slot remains a literal—and lets the partial parser produce a well\-formed graph even when a non\-trivial fraction of the program is still mask\. The analyzer is invoked at a sparse schedule of checkpoints \(Section[D\.7](https://arxiv.org/html/2605.16829#A4.SS7)\), gated by a minimum committed\-fraction so that the rewrite is only triggered when enough of𝒙¯0\(t\)\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}has been decoded for the resulting graph to be informative\.
### D\.2Vulnerability Detection on Partial Programs
The detector inspects𝒢t\\mathcal\{G\}\_\{t\}for two complementary symptoms of insecurity that together cover both fully decoded and partially masked vulnerable shapes\.
Dataflow witnesses\.For each known unsafe sink class, the detector performs a bounded breadth\-first search onEDFGE^\{\\mathrm\{DFG\}\}from candidate*source*nodes \(locations in𝒢t\\mathcal\{G\}\_\{t\}that introduce data outside the program’s trust boundary, such as user input or environment data\) toward candidate*sink*nodes \(locations that consume data in a security\-sensitive way, such as command execution or unparameterized query construction\)\. The search halts whenever it crosses a sanitizer node and yields a witness whenever it terminates at a sink without ever crossing one\. The result is a path of nodes in𝒢t\\mathcal\{G\}\_\{t\}that demonstrates the vulnerable flow\.
Structural witnesses\.For partial programs in which the dangerous identifier is still masked or the dataflow chain has not yet crystallized, the detector also matches local AST shapes that are unsafe regardless of dataflow: a sink callsite whose argument shape bypasses a required guard, an insecure construction template, or a missing structural neighbor that the safe usage requires\. These shape matches fire on subgraphs of𝒢t\\mathcal\{G\}\_\{t\}rather than on dataflow paths and are essential for catching vulnerabilities at early checkpoints when only the coarse program skeleton is committed\.
In both cases, each detection produces a witnesswk=\(nk,τk,hk\)w\_\{k\}=\(n\_\{k\},\\tau\_\{k\},h\_\{k\}\)with the offending nodenk∈Vn\_\{k\}\\in V, the correction typeτk∈\{sub,ins\}\\tau\_\{k\}\\in\\\{\\mathrm\{sub\},\\mathrm\{ins\}\\\}, and a structured remediation hinthkh\_\{k\}identifying the violated property and the recommended safe pattern\. The witnesses across the two detection paths are combined into a single set𝒲t=\{wk\}k=1Kt\\mathcal\{W\}\_\{t\}=\\\{w\_\{k\}\\\}\_\{k=1\}^\{K\_\{t\}\}that populates the violation vector𝝂t\\boldsymbol\{\\nu\}\_\{t\}and structured feedbackrtr\_\{t\}in Eq\.[5](https://arxiv.org/html/2605.16829#S5.E5)\.
### D\.3Localization on the Program Graph
For each witnesswkw\_\{k\}, the localization mapℳMDFI\\mathcal\{M\}^\{\\mathrm\{MDFI\}\}lifts the offending nodenkn\_\{k\}to a coherent token region by walking𝒢t\\mathcal\{G\}\_\{t\}along its surrounding AST and dataflow neighborhood\. The neighborhood includes the smallest enclosing statement node, the immediately neighboring AST nodes aroundnkn\_\{k\}, and the dataflow\-adjacent nodes that define identifiers used bynkn\_\{k\}within a small dataflow radius\. The set of structurally adjacent nodes𝒩\(nk;𝒢t\)\\mathcal\{N\}\(n\_\{k\};\\mathcal\{G\}\_\{t\}\)is then projected back to token positions:
N\(wk\)=Tok\(𝒩\(nk;𝒢t\)\)⊆\[L\]\.N\(w\_\{k\}\)\\;=\\;\\mathrm\{Tok\}\\\!\\bigl\(\\mathcal\{N\}\(n\_\{k\};\\mathcal\{G\}\_\{t\}\)\\bigr\)\\;\\subseteq\\;\[L\]\.\(35\)The full editable region is the union over witnesses, capped by a token budgetBBthat bounds the total fraction of the program that may be revised at any single checkpoint:
𝒮t=TopBudgetB\(⋃k=1KtN\(wk\)\)⊆\[L\]\.\\mathcal\{S\}\_\{t\}\\;=\\;\\mathrm\{TopBudget\}\_\{B\}\\\!\\Bigl\(\\bigcup\_\{k=1\}^\{K\_\{t\}\}N\(w\_\{k\}\)\\Bigr\)\\;\\subseteq\\;\[L\]\.\(36\)When\|⋃kN\(wk\)\|\>B\\bigl\|\\bigcup\_\{k\}N\(w\_\{k\}\)\\bigr\|\>B,TopBudgetB\\mathrm\{TopBudget\}\_\{B\}retains the witnesses with the highest analyzer confidence first; ties are broken at the granularity of AST statements so that the resulting localization preserves syntactic coherence\.
### D\.4Substitute Remasking
For witnesses withτk=sub\\tau\_\{k\}=\\mathrm\{sub\}, the offending construct is already present in the partial program in the wrong form—for instance, an unsafe API call, an unsafe constructor, or a hard\-coded credential\. The operator opens these positions for re\-denoising by setting them back to the mask token\. Let
𝒮tsub=𝒮t∩⋃k:τk=subN\(wk\)\.\\mathcal\{S\}\_\{t\}^\{\\mathrm\{sub\}\}\\;=\\;\\mathcal\{S\}\_\{t\}\\;\\cap\\;\\\!\\\!\\bigcup\_\{k:\\tau\_\{k\}=\\mathrm\{sub\}\}\\\!\\\!N\(w\_\{k\}\)\.\(37\)The substitute operation produces
xt⋆,i=\{\[MASK\],i∈𝒮tsub,xti,i∉𝒮tsub\.x\_\{t\}^\{\\star,i\}\\;=\\;\\begin\{cases\}\\texttt\{\[MASK\]\},&i\\in\\mathcal\{S\}\_\{t\}^\{\\mathrm\{sub\}\},\\\\ x\_\{t\}^\{i\},&i\\notin\\mathcal\{S\}\_\{t\}^\{\\mathrm\{sub\}\}\.\\end\{cases\}\(38\)This is the same form as the standard partial\-mask state defined in Section[5\.1](https://arxiv.org/html/2605.16829#S5.SS1), with the localization set anchored on a witness node rather than on a heuristic score\.
### D\.5Mask Insertion
For witnesses withτk=ins\\tau\_\{k\}=\\mathrm\{ins\}, the bug is the*absence*of a needed construct—for example, a missing input\-validation guard, a missing length check, or a missing exception handler\. Re\-denoising existing tokens cannot resolve these failures because the tokens that should be there do not yet exist in𝒙t\\bm\{x\}\_\{t\}\. The operator instead allocatesKKfresh mask positions adjacent to a structurally chosen anchor inN\(wk\)N\(w\_\{k\}\)and splices them into the partial\-mask state:
𝒙t⋆←InsertK\(𝒙t⋆,anchor\(wk\)\),k:τk=ins\.\\bm\{x\}\_\{t\}^\{\\star\}\\;\\leftarrow\\;\\mathrm\{Insert\}\_\{K\}\\\!\\bigl\(\\bm\{x\}\_\{t\}^\{\\star\},\\;\\mathrm\{anchor\}\(w\_\{k\}\)\\bigr\),\\qquad k:\\tau\_\{k\}=\\mathrm\{ins\}\.\(39\)The anchor is chosen so that the inserted region admits a syntactically coherent infill—typically immediately before the offending sink node, immediately after a relevant assignment, or at the start of the offending block\. The insertion extends the sequence length byKKand right\-shifts the downstream committed tokens to preserve syntactic context\. The localization set𝒮tsub\\mathcal\{S\}\_\{t\}^\{\\mathrm\{sub\}\}used by Eq\.[38](https://arxiv.org/html/2605.16829#A4.E38)is updated to reference the post\-insertion sequence whenever both substitute and insert witnesses are present in𝒲t\\mathcal\{W\}\_\{t\}\.
### D\.6Pre\-Allocated Prompt Buffer
To prevent the next denoiser pass from regenerating the same vulnerable construct under unchanged conditioning, MDFI augments the conditioning contextccwith the analyzer’s remediation hints\. At the start of every trajectory we pre\-allocate a contiguous mask buffer of fixed lengthBpB\_\{p\}inside the prompt context, immediately following the task description\. This buffer is part of the model’s input but is flagged as conditioning rather than as part of the output sequence; the constrained reverse kernel never samples into it, and base\-model attention over fixed prompt positions is unaffected by it\.
At each fired checkpoint, the structured hints\{hk\}\\\{h\_\{k\}\\\}inrtr\_\{t\}are tokenized into a natural\-language remediation messageρt\(rt\)\\rho\_\{t\}\(r\_\{t\}\)that names the violated property and the recommended safe pattern, andρt\(rt\)\\rho\_\{t\}\(r\_\{t\}\)is written into the buffer slots, displacing leading mask tokens\. When multiple checkpoints fire along a trajectory, the most recent message overwrites any previously injected message in the same buffer:
c⋆=Write\(c,BufferBp,ρt\(rt\)\)\.c^\{\\star\}\\;=\\;\\mathrm\{Write\}\\\!\\bigl\(c,\\;\\mathrm\{Buffer\}\_\{B\_\{p\}\},\\;\\rho\_\{t\}\(r\_\{t\}\)\\bigr\)\.\(40\)Because\|c⋆\|=\|c\|\|c^\{\\star\}\|=\|c\|by construction, no token positions inside the program𝒙t⋆\\bm\{x\}\_\{t\}^\{\\star\}are displaced by feedback injection\. When the message is shorter thanBpB\_\{p\}, trailing buffer slots remain mask tokens and the model can still infer “no further hint” from that region\.
### D\.7Checkpoint Schedule and Composition
The MDFI operator is invoked at a predetermined schedule of denoising checkpoints
𝒯ck⊆\{1,…,T\},\\mathcal\{T\}^\{\\mathrm\{ck\}\}\\;\\subseteq\\;\\\{1,\\dots,T\\\},\(41\)gated by a minimum committed\-fractionρmin∈\[0,1\)\\rho\_\{\\min\}\\\!\\in\\\!\[0,1\):
\|\{i∈\[L\]:xti≠\[MASK\]\}\|L≥ρmin\.\\frac\{\\bigl\|\\\{i\\in\[L\]:x\_\{t\}^\{i\}\\neq\\texttt\{\[MASK\]\}\\\}\\bigr\|\}\{L\}\\;\\geq\\;\\rho\_\{\\min\}\.\(42\)This guarantees that the analyzer is consulted only once enough of𝒙¯0\(t\)\\bar\{\\bm\{x\}\}\_\{0\}^\{\(t\)\}has been decoded for the partial program graph𝒢t\\mathcal\{G\}\_\{t\}to be informative\. A global intervention budgetBintB\_\{\\mathrm\{int\}\}caps the total number of fired checkpoints along any single trajectory; once exhausted, the operator collapses to the identity for the rest of the chain\.
When a checkpoint at stepttis active, the operator combines the substitute remask of Eq\.[38](https://arxiv.org/html/2605.16829#A4.E38), the mask insertion of Eq\.[39](https://arxiv.org/html/2605.16829#A4.E39), and the buffer write of Eq\.[40](https://arxiv.org/html/2605.16829#A4.E40)into a single output:
\(𝒙t⋆,c⋆,𝒮t\)=𝒫𝒞MDFI\(𝒙^0\(t\),𝒮t,rt,c\)\.\\bigl\(\\bm\{x\}\_\{t\}^\{\\star\},\\,c^\{\\star\},\\,\\mathcal\{S\}\_\{t\}\\bigr\)\\;=\\;\\mathcal\{P\}\_\{\\mathcal\{C\}\}^\{\\mathrm\{MDFI\}\}\\\!\\bigl\(\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},\\,\\mathcal\{S\}\_\{t\},\\,r\_\{t\},\\,c\\bigr\)\.\(43\)The chain then advances from the modified state and conditioning under the constrained reverse kernel of Eq\.[8](https://arxiv.org/html/2605.16829#S5.E8):𝒙t−1∼pθ𝒞\(⋅∣𝒙t⋆,𝒙^0\(t\),𝒮t;c⋆\)\\bm\{x\}\_\{t\-1\}\\sim p\_\{\\theta\}^\{\\mathcal\{C\}\}\(\\,\\cdot\\mid\\bm\{x\}\_\{t\}^\{\\star\},\\hat\{\\bm\{x\}\}\_\{0\}^\{\(t\)\},\\mathcal\{S\}\_\{t\};\\,c^\{\\star\}\)\. Crucially, MDFI does not run an inner re\-denoising loop on the reopened region; it modifies the state and conditioning in place and proceeds directly to the next reverse step\. As a result, the freshly masked positions are re\-filled by the remainingT−tT\-tstandard transitions under feedback\-aware conditioning, and the total number of reverse transitions along the trajectory is identical to vanilla diffusion\.
The deployed configuration uses sparse late\-trajectory checkpoints firing in the second half of the reverse chain,ρmin=0\.5\\rho\_\{\\min\}\\\!=\\\!0\.5,Bint=2B\_\{\\mathrm\{int\}\}\\\!=\\\!2,K=12K\\\!=\\\!12for insertion,BBchosen as roughly one statement’s worth of tokens, andBpB\_\{p\}chosen to accommodate a single concise remediation message\. Per\-step cost outside checkpoints is identical to vanilla diffusion; at a fired checkpoint, MDFI adds one analyzer invocation plus a single buffer rewrite, after which the same forward pass that the base model would have performed proceeds on the \(slightly longer, when insertion fires\)𝒙t⋆\\bm\{x\}\_\{t\}^\{\\star\}with augmented contextc⋆c^\{\\star\}\.Similar Articles
Colored Noise Diffusion Sampling
Introduces Colored Noise Sampling (CNS), a training-free stochastic solver for diffusion models that dynamically allocates energy based on frequency-dependent schedules, improving image quality metrics like FID significantly on ImageNet-256.
EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models
This paper presents EPIC, an efficient framework for context-free grammar constrained decoding in diffusion language models that reduces inference time by up to 67.5% while maintaining syntactic correctness.
CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
CRoCoDiL proposes a continuous and robust conditioned diffusion approach for language that shifts masked diffusion models into a continuous semantic space, achieving superior generation quality and 10x faster sampling speeds compared to discrete methods like LLaDA.
Drifting Objectives for Refining Discrete Diffusion Language Models
This paper introduces TokenDrift, a drifting objective that refines discrete diffusion language models by lifting categorical predictions to a continuous semantic space for anti-symmetric drifting, significantly improving generation quality under a fixed number of denoising steps.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
This paper reformulates language generation as a stochastic optimal control problem, addressing limitations of autoregressive and diffusion models, and proposes a closed-loop diffusion method in latent control space using Flow Matching, achieving high-fidelity generation and efficient parallel sampling.