Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

arXiv cs.CL Papers

Summary

This paper introduces FeF-DLLM, a discrete diffusion language model that eliminates factorization errors by using exact prefix-conditioned factorization and accelerates inference via speculative decoding, achieving significant improvements in accuracy and speed on benchmarks such as GSM8K and MATH.

arXiv:2605.14305v1 Announce Type: new Abstract: Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Factorization-Error-Free Discrete Diffusion Language Modeling (FeF-DLLM), which replaces independent clean-token prediction with an exact prefix-conditioned factorization of the clean posterior to better preserve token dependencies. To reduce the sequential cost introduced by prefix conditioning, FeF-DLLM further incorporates speculative decoding within diffusion denoising, accelerating inference while maintaining the parallel prediction and re-masking properties of DLLMs. Theoretically, we prove that FeF-DLLM generates from the true joint distribution and derive its expected acceleration ratio. Experiments on GSM8K, MATH, HumanEval, and MBPP demonstrate that our method improves accuracy by an average of 5.04 percentage points while achieving an average inference speedup of $3.86\times$.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:21 AM

# Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding
Source: [https://arxiv.org/html/2605.14305](https://arxiv.org/html/2605.14305)
††footnotetext:The authors contribute equally to the paper and are listed in alphabetical order\.Xun Fang East China Normal University 51264404020@stu\.ecnu\.edu\.cn &Yunchen Li East China Normal University 52284404001@stu\.ecnu\.edu\.cn &Hang Yuan East China Normal University Beijing Zhongguancun Academy 52274404018@stu\.ecnu\.edu\.cn &Zhou Yu East China Normal University zyu@stat\.ecnu\.edu\.cn

###### Abstract

Discrete diffusion language models improve generation efficiency through parallel token prediction, but standardX0X\_\{0\}prediction methods introduce factorization errors by approximating the clean token posterior with independent token\-wise distributions\. This paper proposes Factorization\-Error\-Free Discrete Diffusion Language Modeling \(FeF\-DLLM\), which replaces independent clean\-token prediction with an exact prefix\-conditioned factorization of the clean posterior to better preserve token dependencies\. To reduce the sequential cost introduced by prefix conditioning, FeF\-DLLM further incorporates speculative decoding within diffusion denoising, accelerating inference while maintaining the parallel prediction and re\-masking properties of DLLMs\. Theoretically, we prove that FeF\-DLLM generates from the true joint distribution and derive its expected acceleration ratio\. Experiments on GSM8K, MATH, HumanEval, and MBPP demonstrate that our method improves accuracy by an average of 5\.04 percentage points while achieving an average inference speedup of3\.86×3\.86\\times\.

## 1Introduction

Diffusion models\(Hoet al\.,[2020](https://arxiv.org/html/2605.14305#bib.bib7); Lipmanet al\.,[2022](https://arxiv.org/html/2605.14305#bib.bib31); Songet al\.,[2020b](https://arxiv.org/html/2605.14305#bib.bib32),[a](https://arxiv.org/html/2605.14305#bib.bib45)\)have become one of the most successful classes of generative models in recent years, achieving strong performance in image generation\(Rombachet al\.,[2022](https://arxiv.org/html/2605.14305#bib.bib48); Peebles and Xie,[2023](https://arxiv.org/html/2605.14305#bib.bib41); Maet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib49)\), video generation\(Hoet al\.,[2022](https://arxiv.org/html/2605.14305#bib.bib38); Bar\-Talet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib39)\), and many other domains\(Wuet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib50); Yimet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib51); Wanget al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib52)\)\. Recently, several works have extended diffusion modeling to discrete spaces\(Austinet al\.,[2021a](https://arxiv.org/html/2605.14305#bib.bib9); Louet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib12); Gatet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib47); Sahooet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib13)\)and applied discrete diffusion language models \(DLLMs\) to large language modeling\(Nieet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib15); Yeet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib46); Bieet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib43)\)\. These models have shown competitive performance with autoregressive models while enabling a different generation paradigm based on iterative denoising and parallel token prediction\.

During generation, DLLMs usually predict multiple tokens in parallel and combine these token\-wise predictions to form the final output\. Although this design improves generation efficiency, it introduces a factorization error because the joint distribution over clean tokens is approximated by a product of independent token distributions\. Recent works have attempted to address this issue from different perspectives\. ReDi\(Yooet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib35)\)mitigates factorization error through a rectified\-flow formulation and newly constructed paired training data, but it does not fully remove the error\. Other methods propose improved sampling procedures\(Liuet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib33); Lavenant and Zanella,[2025](https://arxiv.org/html/2605.14305#bib.bib37)\), often at the cost of significantly slower generation\. Some approaches integrate DLLMs with speculative decoding\(Campbellet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib34); Chenget al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib36)\); however, such designs introduce autoregressive decoding behavior into DLLM inference, thereby undermining the distinctive non\-autoregressive reasoning characteristics of DLLMs\.

In this paper, we propose*Factorization\-Error\-Free Discrete Diffusion Language Modeling*\(FeF\-DLLM\)\. Built upon the classicalX0X\_\{0\}\-prediction framework of DLLMs, FeF\-DLLM analyzes the exact decomposition of the clean posterior distribution and derives a factorization\-error\-free generation objective\. Instead of independently predicting each clean token, our method uses prefix\-conditioned prediction to preserve dependencies among clean tokens\. To improve inference efficiency, we further incorporate speculative decoding, which amortizes the sequential dependency introduced by prefix conditioning and accelerates generation while retaining the parallel prediction and re\-masking properties of DLLMs\.

We theoretically show that FeF\-DLLM can generate samples from the true conditional joint distribution when the position\-conditioned target model is well specified, and we analyze the expected speedup brought by speculative decoding\. We evaluate FeF\-DLLM on standard mathematical reasoning and code generation benchmarks, including GSM8K, MATH, HumanEval, and MBPP\. Experimental results show that FeF\-DLLM improves accuracy by an average of 5\.04 percentage points and achieves an average inference speedup of3\.86×3\.86\\times, demonstrating its effectiveness in improving generation quality while substantially accelerating inference\.

Our contributions are summarized as follows:

- •We analyze the factorization error in existing DLLMs and show that it can be eliminated through an exact prefix\-conditioned factorization of the clean posterior\.
- •We use speculative decoding to accelerate the resulting prefix\-conditioned inference procedure and derive its expected speedup\.
- •We conduct extensive experiments to evaluate FeF\-DLLM\. Experimental results demonstrate that FeF\-DLLM consistently improves generation quality over the baseline while simultaneously achieving substantial wall\-clock acceleration\.

## 2Preliminary

### 2\.1Discrete Diffusion Language Models

Discrete diffusion language models define a diffusion process directly over discrete tokens\. Building on the D3PM framework\(Austinet al\.,[2021a](https://arxiv.org/html/2605.14305#bib.bib9)\), we define the forward diffusion process as a fixed Markov chainq​\(X1:T∣X0\)=∏t=1Tq​\(Xt∣Xt−1\)q\(X\_\{1:T\}\\mid X\_\{0\}\)=\\prod\_\{t=1\}^\{T\}q\(X\_\{t\}\\mid X\_\{t\-1\}\), in which each transition kernel is parameterized by a categorical transition matrixQt∈ℝS×SQ\_\{t\}\\in\\mathbb\{R\}^\{S\\times S\}, whereSSdenotes the vocabulary size\. For a one\-hot tokenXt−1X\_\{t\-1\}, the one\-step corruption process is

q​\(Xt∣Xt−1\)=Cat​\(Xt;p=Xt−1​Qt\),q\(X\_\{t\}\\mid X\_\{t\-1\}\)=\\mathrm\{Cat\}\(X\_\{t\};\\,p=X\_\{t\-1\}Q\_\{t\}\),
whereCat​\(x;p\)\\mathrm\{Cat\}\(x;p\)denotes a categorical distribution over the one\-hot row vectorxx, with probabilities given by row vectorpp\. LetQ¯t=Q1​Q2​⋯​Qt\\bar\{Q\}\_\{t\}=Q\_\{1\}Q\_\{2\}\\cdots Q\_\{t\}\. The marginal at timestepttisq​\(Xt∣X0\)=Cat​\(Xt;p=X0​Q¯t\)q\(X\_\{t\}\\mid X\_\{0\}\)=\\mathrm\{Cat\}\(X\_\{t\};\\,p=X\_\{0\}\\bar\{Q\}\_\{t\}\), enabling noisy tokens to be sampled directly from the clean input\. For sequence data, corruption is typically applied independently across token positions\. Common choices ofQtQ\_\{t\}include uniform transitions, nearest\-neighbor transitions in embedding space, and absorbing\-state transitions that gradually replace tokens with a special\[MASK\]\[\\mathrm\{MASK\}\]token\.

The forward posterior has the following closed form:

q​\(Xt−1∣Xt,X0\)=Cat​\(Xt−1;p=Xt​Qt⊤⊙X0​Q¯t−1X0​Q¯t​Xt⊤\)\.q\(X\_\{t\-1\}\\mid X\_\{t\},X\_\{0\}\)=\\mathrm\{Cat\}\\left\(X\_\{t\-1\};p=\\frac\{X\_\{t\}Q\_\{t\}^\{\\top\}\\odot X\_\{0\}\\bar\{Q\}\_\{t\-1\}\}\{X\_\{0\}\\bar\{Q\}\_\{t\}X\_\{t\}^\{\\top\}\}\\right\)\.The generative process is a learned reverse Markov chainpθ​\(X0:T\)=p​\(XT\)​∏t=1Tpθ​\(Xt−1∣Xt\)p\_\{\\theta\}\(X\_\{0:T\}\)=p\(X\_\{T\}\)\\prod\_\{t=1\}^\{T\}p\_\{\\theta\}\(X\_\{t\-1\}\\mid X\_\{t\}\)\. In theX0X\_\{0\}\-prediction parameterization, the model first predicts the clean token from the corrupted input, denoted asX^0=fθ​\(Xt,t\)\\hat\{X\}\_\{0\}=f\_\{\\theta\}\(X\_\{t\},t\)\. Then the reverse transition is constructed by plugging this prediction into the analytic forward posterior:

pθ​\(Xt−1∣Xt\)=q​\(Xt−1∣Xt,X^0\)\.p\_\{\\theta\}\(X\_\{t\-1\}\\mid X\_\{t\}\)=q\(X\_\{t\-1\}\\mid X\_\{t\},\\hat\{X\}\_\{0\}\)\.\(1\)
Equivalently, the model uses the predicted clean token to determine how to denoiseXtX\_\{t\}by one step\. The corresponding clean\-token prediction objective is

ℒ=𝔼q​\(X0\)​𝔼t​𝔼q​\(Xt∣X0\)​\[−∑i=1Nlog⁡pθ​\(X0i∣Xt,t\)\]\.\\mathcal\{L\}=\\mathbb\{E\}\_\{q\(\{X\}\_\{0\}\)\}\\mathbb\{E\}\_\{t\}\\mathbb\{E\}\_\{q\(X\_\{t\}\\mid X\_\{0\}\)\}\\left\[\-\\sum\_\{i=1\}^\{N\}\\log p\_\{\\theta\}\(X\_\{0\}^\{i\}\\mid X\_\{t\},t\)\\right\]\.\(2\)

### 2\.2Speculative Decoding

Speculative decoding\(Leviathanet al\.,[2023](https://arxiv.org/html/2605.14305#bib.bib6); Chenet al\.,[2023](https://arxiv.org/html/2605.14305#bib.bib53)\)accelerates autoregressive inference by using a smaller and faster approximation model to propose multiple candidate tokens, which are then verified in parallel by the target model\. LetMπM\_\{\\pi\}denote the target model, and letπ​\(Xi∣X<i\)\\pi\(X^\{i\}\\mid X^\{<i\}\)be its next\-token distribution given the prefixX<iX^\{<i\}\. We further introduce an efficient approximation modelMρM\_\{\\rho\}, whose next\-token distribution is denoted byρ​\(Xt∣X<i\)\\rho\(X\_\{t\}\\mid X^\{<i\}\)\. Speculative decoding proceeds as follows\. First, the approximation modelMρM\_\{\\rho\}autoregressively generatesγ\\gammacandidate tokens, denoted byX1,…,XγX^\{1\},\\ldots,X^\{\\gamma\}\. Then, the target modelMπM\_\{\\pi\}evaluates the corresponding candidate prefixes in parallel and obtains the target distributions at each position\. For theii\-th proposed tokenXiX^\{i\}, the algorithm accepts it with probability

min⁡\(1,πi​\(Xi\)ρi​\(Xi\)\),\\min\\left\(1,\\frac\{\\pi\_\{i\}\(X^\{i\}\)\}\{\\rho\_\{i\}\(X^\{i\}\)\}\\right\),whereπi\\pi\_\{i\}andρi\\rho\_\{i\}denote the target and approximation distributions under the corresponding prefix, respectively\. If a proposed token is rejected, the algorithm resamples from the corrected distribution

π′​\(X\)=norm⁡\(max⁡\(0,π​\(X\)−ρ​\(X\)\)\)\.\\pi^\{\\prime\}\(X\)=\\operatorname\{norm\}\\left\(\\max\\left\(0,\\pi\(X\)\-\\rho\(X\)\\right\)\\right\)\.This correction ensures that the resulting sample is still exactly distributed according to the target distributionπ\\pi\. Therefore, speculative decoding produces samples from the same distribution as direct decoding fromMπM\_\{\\pi\}\. Benefiting from the parallel verification of multiple draft tokens, speculative decoding can substantially improve generation speed\.

### 2\.3Setting and Notation

We consider add\-dimensional discrete random variableX=\(X1,…,Xd\)⊤X=\(X^\{1\},\\dots,X^\{d\}\)^\{\\top\}defined on𝒱d\\mathcal\{V\}^\{d\}\. Subscripts are used to index decoding steps in diffusion language model inference, while superscripts are used to index token positions in a sequence\. For example,XtiX\_\{t\}^\{i\}denotes the random token at positioniiwhen the diffusion decoder is at steptt\. We writeX<iX^\{<i\}andX\>iX^\{\>i\}to denote the subsequence of tokens before and after positionii\. In the context of speculative decoding, we useπ\\pito denote the target distribution andρ\\rhoto denote the draft distribution\.⊙\\odotis denoted as element\-wise multiplication\.

## 3Methodology

### 3\.1Analysis of Factorization Error

We first revisit the commonly usedX0X\_\{0\}\-prediction parameterization in discrete diffusion language models\. Given a corrupted sequenceXtX\_\{t\}, the reverse transition in Eq\.[1](https://arxiv.org/html/2605.14305#S2.E1)is constructed from an estimate of the clean\-data posteriorp​\(X0∣Xt\)p\(X\_\{0\}\\mid X\_\{t\}\)\. In principle, this posterior is a distribution over the full discrete sequence space𝒱d\\mathcal\{V\}^\{d\}, whose cardinality grows exponentially with the sequence lengthdd\. Directly modeling such a joint distribution is computationally intractable for neural language models\. Therefore, existing DLLMs usually adopt a token\-wise prediction objective, in which the model predicts each clean token independently conditioned on the same corrupted input:

p​\(X0∣Xt\)≈∏i=1dp​\(X0i∣Xt\)\.p\(X\_\{0\}\\mid X\_\{t\}\)\\approx\\prod\_\{i=1\}^\{d\}p\(X\_\{0\}^\{i\}\\mid X\_\{t\}\)\.\(3\)
However, Eq\.[3](https://arxiv.org/html/2605.14305#S3.E3)relies on an independence assumption across dimensions\. Such an assumption is generally incompatible with natural language, where tokens exhibit strong syntactic and semantic dependencies\. The exact decomposition of the clean posterior follows the chain rule:

p​\(X0∣Xt\)=∏i=1dp​\(X0i∣Xt,X0<i\)\.p\(X\_\{0\}\\mid X\_\{t\}\)=\\prod\_\{i=1\}^\{d\}p\(X\_\{0\}^\{i\}\\mid X\_\{t\},X\_\{0\}^\{<i\}\)\.Compared with Eq\.[3](https://arxiv.org/html/2605.14305#S3.E3), this factorization preserves dependencies among clean tokens by conditioning each position on the previously recovered clean prefixX0<iX\_\{0\}^\{<i\}\. The discrepancy between the token\-wise approximation and the exact factorization is therefore characterized by the missing dependence onX0<iX\_\{0\}^\{<i\}\. To make this discrepancy explicit, we derive the following identity\.

###### Lemma 1\.

Assume that the forward corruption process factorizes across token positions, i\.e\.,q​\(Xt∣X0\)=∏j=1dq​\(Xtj∣X0j\)q\(X\_\{t\}\\mid X\_\{0\}\)=\\prod\_\{j=1\}^\{d\}q\(X\_\{t\}^\{j\}\\mid X\_\{0\}^\{j\}\)\. For any positionii, the autoregressive clean\-token posterior satisfies

p​\(X0i∣Xt,X0<i\)=p​\(X0i∣Xt≥i,X0<i\)∝p​\(X0i∣Xt\)​p​\(X0i∣Xt\>i,X0<i\)p​\(X0i∣Xt−i\),\\begin\{split\}p\(X\_\{0\}^\{i\}\\mid X\_\{t\},X\_\{0\}^\{<i\}\)&=p\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\\geq i\},X\_\{0\}^\{<i\}\)\\\\ &\\propto p\(X\_\{0\}^\{i\}\\mid X\_\{t\}\)\\frac\{p\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\>i\},X\_\{0\}^\{<i\}\)\}\{p\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\-i\}\)\},\\end\{split\}whereXt−i=\(Xt<i,Xt\>i\)X\_\{t\}^\{\-i\}=\(X\_\{t\}^\{<i\},X\_\{t\}^\{\>i\}\), and the proportionality is overX0iX\_\{0\}^\{i\}\.

Lemma[1](https://arxiv.org/html/2605.14305#Thmlemma1)shows that the desired autoregressive posteriorp​\(X0i∣Xt,X0<i\)p\(X\_\{0\}^\{i\}\\mid X\_\{t\},X\_\{0\}^\{<i\}\)is equivalent to conditioning on\(Xt≥i,X0<i\)\(X\_\{t\}^\{\\geq i\},X\_\{0\}^\{<i\}\), and differs from the independent predictorp​\(X0i∣Xt\)p\(X\_\{0\}^\{i\}\\mid X\_\{t\}\)by the normalized correction termp​\(X0i∣Xt\>i,X0<i\)/p​\(X0i∣Xt−i\)p\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\>i\},X\_\{0\}^\{<i\}\)/p\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\-i\}\)\. This term captures the additional dependence on the clean prefix and the remaining corrupted context\. Therefore, predicting theii\-th clean token should account not only for the corrupted sequenceXtX\_\{t\}, but also for previously determined clean tokens; ignoring this dependence leads to factorization error in the learned reverse process\.

Motivated by this observation, we replace the independent clean\-token predictor with a prefix\-conditioned predictor\. Instead of estimatingpθ​\(X0i∣Xt,t\)p\_\{\\theta\}\(X\_\{0\}^\{i\}\\mid X\_\{t\},t\)independently for each position, we train the model to approximate

pθ​\(X0i∣Xt≥i,X0<i,t\)\.p\_\{\\theta\}\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\\geq i\},X\_\{0\}^\{<i\},t\)\.
This parameterization preserves the left\-to\-right dependency structure of the clean sequence while still leveraging the bidirectional corrupted context available in diffusion inference\. The resulting training objective is

ℒ=𝔼q​\(X0\)​𝔼t​𝔼q​\(Xt∣X0\)​\[−∑i=1dlog⁡pθ​\(X0i∣Xt≥i,X0<i,t\)\]\.\\mathcal\{L\}=\\mathbb\{E\}\_\{q\(X\_\{0\}\)\}\\mathbb\{E\}\_\{t\}\\mathbb\{E\}\_\{q\(X\_\{t\}\\mid X\_\{0\}\)\}\\left\[\-\\sum\_\{i=1\}^\{d\}\\log p\_\{\\theta\}\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\\geq i\},X\_\{0\}^\{<i\},t\)\\right\]\.\(4\)This objective has the same supervised clean\-token prediction form as Eq\.[2](https://arxiv.org/html/2605.14305#S2.E2)\. Consequently, it can be optimized by finetuning existing DLLM backbones with modified input, without changing the underlying discrete diffusion forward process\. The corresponding training procedure is summarized in Algorithm[1](https://arxiv.org/html/2605.14305#alg1)in Appendix[C](https://arxiv.org/html/2605.14305#A3)\.

### 3\.2Acceleration via Speculative Decoding

The prefix\-conditioned predictor introduced in Eq\.[4](https://arxiv.org/html/2605.14305#S3.E4)enables inference from the correct joint distribution of clean tokens, thereby eliminating the factorization error induced by the independence assumption, but it also changes the inference pattern of DLLMs\. In standard DLLM inference, all clean\-token predictions can be produced in parallel from the corrupted sequenceXtX\_\{t\}\. In contrast,pθ​\(X0i∣Xt≥i,X^0<i,t\)p\_\{\\theta\}\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\\geq i\},\\hat\{X\}\_\{0\}^\{<i\},t\)depends on the previously recovered clean prefixX^0<i\\hat\{X\}\_\{0\}^\{<i\}\. Therefore, directly sampling from this distribution requires left\-to\-right decoding within each denoising step, which may reduce the parallel efficiency of diffusion\-based generation\.

To mitigate this sequential bottleneck, we adopt speculative decoding within each diffusion denoising step\. The key idea is to use a fast draft model to propose multiple clean tokens in parallel, and then verify these proposals with the prefix\-conditioned target model\. The complete inference algorithm is provided in Algorithm[2](https://arxiv.org/html/2605.14305#alg2)in Appendix[C](https://arxiv.org/html/2605.14305#A3)\. Suppose that the firstmmpositions have already been determined, denoted byX^0<m\\hat\{X\}\_\{0\}^\{<m\}\. We consider a speculative window of lengthkk, covering positionsm\+1,…,m\+km\+1,\\ldots,m\+k\. The draft model defines a distribution

ρϕi​\(X0i\):=ρϕ​\(X0i∣X^0<m,Xt\>m,t\),i=m\+1,…,m\+k\.\\rho\_\{\\phi\}^\{i\}\(X\_\{0\}^\{i\}\):=\\rho\_\{\\phi\}\(X\_\{0\}^\{i\}\\mid\\hat\{X\}\_\{0\}^\{<m\},X\_\{t\}^\{\>m\},t\),\\qquad i=m\+1,\\ldots,m\+k\.Unlike the target model, the draft model conditions only on the already verified prefixX^0<m\\hat\{X\}\_\{0\}^\{<m\}and the remaining corrupted contextXt\>mX\_\{t\}^\{\>m\}\. Thus, the draft distributions for all positions in the speculative window can be computed in parallel\. We then sample draft tokensX~0m\+1,…,X~0m\+k∼∏i=m\+1m\+kρϕi​\(⋅\)\.\\tilde\{X\}\_\{0\}^\{m\+1\},\\ldots,\\tilde\{X\}\_\{0\}^\{m\+k\}\\sim\\prod\_\{i=m\+1\}^\{m\+k\}\\rho\_\{\\phi\}^\{i\}\(\\cdot\)\.After the draft tokens are generated, the target model verifies them from left to right\. For each positioni=m\+1,…,m\+ki=m\+1,\\ldots,m\+k, the target distribution is evaluated using the prefix formed by the already accepted tokens and the preceding draft tokens:

πθi​\(X0i\):=pθ​\(X0i∣Xt≥i,X~0<i,t\)\.\\pi\_\{\\theta\}^\{i\}\(X\_\{0\}^\{i\}\):=p\_\{\\theta\}\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\\geq i\},\\tilde\{X\}\_\{0\}^\{<i\},t\)\.The proposed tokenX~0i\\tilde\{X\}\_\{0\}^\{i\}is accepted with probability

min⁡\(1,πθi​\(X~0i\)ρϕi​\(X~0i\)\)\.\\min\\left\(1,\\frac\{\\pi\_\{\\theta\}^\{i\}\(\\tilde\{X\}\_\{0\}^\{i\}\)\}\{\\rho\_\{\\phi\}^\{i\}\(\\tilde\{X\}\_\{0\}^\{i\}\)\}\\right\)\.
IfX~0i\\tilde\{X\}\_\{0\}^\{i\}is accepted, we setX^0i=X~0i\\hat\{X\}\_\{0\}^\{i\}=\\tilde\{X\}\_\{0\}^\{i\}and continue to verify the next position\. If it is rejected, we resampleX^0i\\hat\{X\}\_\{0\}^\{i\}from the corrected distribution

πθ′⁣i​\(x\)=norm⁡\(max⁡\(0,πθi​\(x\)−ρϕi​\(x\)\)\),\\pi\_\{\\theta\}^\{\\prime i\}\(x\)=\\operatorname\{norm\}\\left\(\\max\(0,\\pi\_\{\\theta\}^\{i\}\(x\)\-\\rho\_\{\\phi\}^\{i\}\(x\)\)\\right\),and terminate the current speculative window\. The next speculative step then starts from the updated verified prefix\. This procedure is repeated until all positions in the block have been inferred, yielding a complete predictionX^0\\hat\{X\}\_\{0\}\. OnceX^0\\hat\{X\}\_\{0\}is obtained, we construct the reverse transition following the standardX0X\_\{0\}\-prediction parameterization:

pθ​\(Xt−1∣Xt\)=q​\(Xt−1∣Xt,X^0\),p\_\{\\theta\}\(X\_\{t\-1\}\\mid X\_\{t\}\)=q\(X\_\{t\-1\}\\mid X\_\{t\},\\hat\{X\}\_\{0\}\),as in Eq\.[1](https://arxiv.org/html/2605.14305#S2.E1)\. Therefore, our method modifies only the clean\-token prediction stage, while leaving the discrete forward process and the analytic posterior transition unchanged\.

The choice of the draft model is flexible\. In principle, any model that approximates the target conditional distribution and supports efficient sampling can be used asρϕ\\rho\_\{\\phi\}\. In practice, we use a DLLM\-style draft model because its token predictions can be computed in parallel within a speculative window, resulting in anO​\(1\)O\(1\)drafting cost with respect to the window length\. Moreover, using a draft model with the same architecture family as the target model often yields higher acceptance rates, since the draft distribution is better aligned with the prefix\-conditioned target distribution\. The complete inference algorithm is provided in Algorithm[2](https://arxiv.org/html/2605.14305#alg2)in Appendix[C](https://arxiv.org/html/2605.14305#A3)\. The overall training and inference pipeline of FeF\-DLLM is illustrated in Figure[1](https://arxiv.org/html/2605.14305#S3.F1)\.

![Refer to caption](https://arxiv.org/html/2605.14305v1/x1.png)Figure 1:Overview of FeF\-DLLM\. Prefix\-conditioned training predicts each clean token from the clean prefixX0<iX\_\{0\}^\{<i\}and the remaining corrupted contextXt≥iX\_\{t\}^\{\\geq i\}\. During inference, a draft model proposes a speculative window in parallel, and a prefix\-conditioned target model verifies the proposals left to right before constructing the standard reverse transition\.
### 3\.3Theoretical Properties

We establish two theoretical properties of FeF\-DLLM\. The first result shows that the proposed prefix\-conditioned generation rule is distributionally exact whenever the target model matches the true position\-conditioned posterior\. The second result characterizes the expected progress of speculative verification under a standard independent\-acceptance approximation\.

#### Distributional exactness\.

Fix a denoising stateXtX\_\{t\}and let

Jt=\{j1,…,jLt\},j1<⋯<jLt,J\_\{t\}=\\\{j\_\{1\},\\ldots,j\_\{L\_\{t\}\}\\\},\\qquad j\_\{1\}<\\cdots<j\_\{L\_\{t\}\},be the ordered set of positions to be updated at timesteptt\. Letp⋆p^\{\\star\}denote the oracle data law\. We assume the target lawπθ\\pi\_\{\\theta\}induced bypθp\_\{\\theta\}matches the true position\-conditioned posterior underp⋆p^\{\\star\}\. For everym∈\{1,…,Lt\}m\\in\\\{1,\\ldots,L\_\{t\}\\\}, we take the target model to be the oracle next\-position law

πθjm\(x\)=p⋆\(X0jm=x\|Xt≥jm,X0<jm,t\)\.\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\)=p^\{\\star\}\\\!\\left\(X\_\{0\}^\{j\_\{m\}\}=x\\,\\middle\|\\,X\_\{t\}^\{\\geq j\_\{m\}\},X\_\{0\}^\{<j\_\{m\}\},t\\right\)\.\(5\)
###### Theorem 1\(Exact joint law\)\.

Under Eq\.[5](https://arxiv.org/html/2605.14305#S3.E5), the sequence produced by FeF\-DLLM satisfies, for anyxJt∈𝒱Ltx\_\{J\_\{t\}\}\\in\\mathcal\{V\}^\{L\_\{t\}\},

Pr\(X^0,Jt=xJt\|Xt,t\)=∏m=1Ltp⋆\(X0jm=xjm\|Xt≥jm,x0<jm,t\)\.\\Pr\\\!\\left\(\\hat\{X\}\_\{0,J\_\{t\}\}=x\_\{J\_\{t\}\}\\,\\middle\|\\,X\_\{t\},t\\right\)=\\prod\_\{m=1\}^\{L\_\{t\}\}p^\{\\star\}\\\!\\left\(X\_\{0\}^\{j\_\{m\}\}=x^\{j\_\{m\}\}\\,\\middle\|\\,X\_\{t\}^\{\\geq j\_\{m\}\},x\_\{0\}^\{<j\_\{m\}\},t\\right\)\.Equivalently,

Pr⁡\(X^0,Jt=xJt\|Xt,t\)=p⋆​\(xJt∣Xt,t\)\.\\Pr\\\!\\left\(\\hat\{X\}\_\{0,J\_\{t\}\}=x\_\{J\_\{t\}\}\\,\\middle\|\\,X\_\{t\},t\\right\)=p^\{\\star\}\(x\_\{J\_\{t\}\}\\mid X\_\{t\},t\)\.

###### Corollary 1\(Conditional correctness under resampling\)\.

Consider any resampling passs≥1s\\geq 1, and let

R\(s\)=\{r1\(s\),…,rLs\(s\)\},r1\(s\)<⋯<rLs\(s\),R^\{\(s\)\}=\\\{r\_\{1\}^\{\(s\)\},\\ldots,r\_\{L\_\{s\}\}^\{\(s\)\}\\\},\\qquad r\_\{1\}^\{\(s\)\}<\\cdots<r\_\{L\_\{s\}\}^\{\(s\)\},be the positions selected for resampling\. Conditional on the previous sequenceX^0\(s−1\)\\hat\{X\}\_\{0\}^\{\(s\-1\)\}and the selected setR\(s\)R^\{\(s\)\}, assume Eq\.[5](https://arxiv.org/html/2605.14305#S3.E5)holds onR\(s\)R^\{\(s\)\}, then for anyxR\(s\)∈𝒱Lsx\_\{R^\{\(s\)\}\}\\in\\mathcal\{V\}^\{L\_\{s\}\},

Pr\(X^0,R\(s\)\(s\)=xR\(s\)\|X^0\(s−1\),R\(s\),t\)=p⋆\(xR\(s\)\|X^0\(s−1\),R\(s\),t\)\.\\Pr\\\!\\left\(\\hat\{X\}\_\{0,R^\{\(s\)\}\}^\{\(s\)\}=x\_\{R^\{\(s\)\}\}\\,\\middle\|\\,\\hat\{X\}\_\{0\}^\{\(s\-1\)\},R^\{\(s\)\},t\\right\)=p^\{\\star\}\\\!\\left\(x\_\{R^\{\(s\)\}\}\\,\\middle\|\\,\\hat\{X\}\_\{0\}^\{\(s\-1\)\},R^\{\(s\)\},t\\right\)\.

The corollary is a conditional statement: once the resampling set is fixed, the same distributional correctness argument applies to the selected positions\. Therefore, repeated low\-confidence resampling preserves the correctness of each conditional regeneration step\.

#### Expected speculative progress\.

We next quantify the expected number of positions committed in one speculative round\. Letkkdenote the window size and letα\\alphadenote the average probability that a draft token is accepted by the target model\. Following the standard analysis of speculative decoding\(Leviathanet al\.,[2023](https://arxiv.org/html/2605.14305#bib.bib6)\), we assume that acceptance events within a window are independent\.

###### Theorem 2\(Expected committed length\)\.

LetCkC\_\{k\}be the number of positions committed in one speculative round, where the first rejected position, if any, is also committed after correction\. Then

𝔼​\[Ck\]=∑ℓ=0k−1αℓ=\{1−αk1−α,0≤α<1,k,α=1\.\\mathbb\{E\}\[C\_\{k\}\]=\\sum\_\{\\ell=0\}^\{k\-1\}\\alpha^\{\\ell\}=\\begin\{cases\}\\dfrac\{1\-\\alpha^\{k\}\}\{1\-\\alpha\},&0\\leq\\alpha<1,\\\\\[6\.0pt\] k,&\\alpha=1\.\\end\{cases\}

###### Corollary 2\(Idealized acceleration ratio\)\.

Letcρc\_\{\\rho\}andcπc\_\{\\pi\}denote the wall\-clock costs of one draft proposal pass and one target verification pass, respectively\. Relative to a prefix\-conditioned sequential target baseline that commits one position per target pass, the idealized expected acceleration ratio is

S=𝔼​\[Ck\]​cπcρ\+cπ\.S=\\frac\{\\mathbb\{E\}\[C\_\{k\}\]\\,c\_\{\\pi\}\}\{c\_\{\\rho\}\+c\_\{\\pi\}\}\.In particular, ifcρ≈cπc\_\{\\rho\}\\approx c\_\{\\pi\}, then

S≈𝔼​\[Ck\]2=\{1−αk2​\(1−α\),0≤α<1,k2,α=1\.S\\approx\\frac\{\\mathbb\{E\}\[C\_\{k\}\]\}\{2\}=\\begin\{cases\}\\dfrac\{1\-\\alpha^\{k\}\}\{2\(1\-\\alpha\)\},&0\\leq\\alpha<1,\\\\\[6\.0pt\] \\dfrac\{k\}\{2\},&\\alpha=1\.\\end\{cases\}

Thus, larger windows and higher acceptance rates increase the expected number of committed positions per speculative round, while the actual wall\-clock gain is additionally modulated by the relative costs of drafting and verification\.

## 4Experiment

### 4\.1Experimental Setup

We use a supervised finetuning set of 72 thousand prompt–response pairs for training, covering both code and math data\. Each training example is organized as a prompt–response pair, where only response tokens are masked and predicted\. The finetuned base model is optimized with the proposed position\-conditioned objective in Eq\.[4](https://arxiv.org/html/2605.14305#S3.E4)\. We optimize all models with AdamW using a weight decay of 0\.1\. The learning rate is initialized at1×10−51\\times 10^\{\-5\}, with 50 warmup steps, followed by a constant phase and a linear decay over the final 10% of training steps to 0\.1 times the peak learning rate\. We use a per\-device batch size of 1 and gradient accumulation over 2 steps, yielding a global batch size of 16\. We train for 1 epochs and use bf16 mixed precision\.

During speculation, both the draft and target models in speculative decoding are instantiated with the fine\-tuned model, and the window size is set to 16\. This choice is further validated in the ablation study\. All training and inference experiments are conducted on 8 NVIDIA A100 80GB GPUs\.

### 4\.2Main Results

We evaluate the models on four widely used benchmarks: GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.14305#bib.bib25)\), MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.14305#bib.bib26)\), HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.14305#bib.bib28)\), and MBPP\(Austinet al\.,[2021b](https://arxiv.org/html/2605.14305#bib.bib29)\), covering both mathematical reasoning and code generation\. To ensure reproducibility, we rely on the standardized OpenCompass\(Contributors,[2023](https://arxiv.org/html/2605.14305#bib.bib44)\)evaluation pipeline\. We report two main metrics: Accuracy, which measures final task performance, and Speedup, which measures the wall\-clock acceleration ratio relative to the baseline decoding method\. In all experiments, we use the inference time of LLaDA\-Instruct as the reference runtime, denoted as1×1\\times\.

We adopt LLaDA\-Instruct\(Nieet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib15)\)as the backbone model and build our method upon it; consequently, LLaDA\-Instruct serves as the primary baseline for comparison\. In addition, we compare our approach with several representative methods, including SSD\(Gaoet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib30)\), DDOSP\(Lavenant and Zanella,[2025](https://arxiv.org/html/2605.14305#bib.bib37)\), and DCD\(Liuet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib33)\)\. Since the original papers, with the exception of LLaDA\-Instruct, do not report results on the benchmarks considered in this work, we reproduce all remaining baselines under a unified evaluation protocol\. Further implementation details are provided in Appendix[A](https://arxiv.org/html/2605.14305#A1)\. It is worth noting that the original LLaDA decodes only one token at each step, which partially mitigates factorization error\. To more directly evaluate the effectiveness of our method under a comparable multi\-token decoding setting, we additionally reproduce a variant of LLaDA that decodes two tokens per step, denoted as LLaDA/2\. The results are reported in Table[1](https://arxiv.org/html/2605.14305#S4.T1)\.

Table 1:Main results on mathematical reasoning and code generation benchmarks\. Best accuracy values are in bold, and fastest speed values are underlined\. Mean denotes the average results across all four benchmarks\. For FeF\-DLLM, step denotes the number of diffusion model steps\.MethodsGSM8KMATHHumanEvalMBPPMeanAcc\.SpeedAcc\.SpeedAcc\.SpeedAcc\.SpeedAcc\.SpeedLLaDA\(Nieet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib15)\)78\.601\.00×1\.00\\times26\.601\.00×1\.00\\times47\.601\.00×1\.00\\times34\.20††footnotemark:1\.00×1\.00\\times46\.751\.00×1\.00\\timesLLaDA/2\(Nieet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib15)\)76\.421\.92×1\.92\\times25\.461\.99×1\.99\\times32\.322\.02×2\.02\\times35\.001\.98×1\.98\\times42\.301\.98×1\.98\\timesSSD\(Gaoet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib30)\)77\.102\.23×2\.23\\times34\.942\.16×2\.16\\times43\.092\.12×2\.12\\times39\.201\.83×1\.83\\times48\.582\.09×2\.09\\timesDDOSP\(Lavenant and Zanella,[2025](https://arxiv.org/html/2605.14305#bib.bib37)\)74\.151\.92×1\.92\\times25\.782\.03×2\.03\\times28\.052\.01×2\.01\\times29\.201\.98×1\.98\\times39\.301\.98×1\.98\\timesDCD\(Liuet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib33)\)78\.240\.36×0\.36\\times26\.360\.51×0\.51\\times50\.000\.43×0\.43\\times37\.600\.40×0\.40\\times48\.050\.43×0\.43\\timesFeF\-DLLM \(step=2\)79\.383\.55ׯ\\underline\{3\.55\\times\}36\.403\.25ׯ\\underline\{3\.25\\times\}48\.783\.76ׯ\\underline\{3\.76\\times\}42\.604\.89ׯ\\underline\{4\.89\\times\}51\.793\.86ׯ\\underline\{3\.86\\times\}FeF\-DLLM \(step=4\)79\.682\.14×2\.14\\times36\.561\.99×1\.99\\times49\.392\.21×2\.21\\times42\.602\.99×2\.99\\times52\.062\.33×2\.33\\times

††footnotetext:The reproduced LLaDA accuracy on MBPP is 37\.40; we report 34\.20 from the original LLaDA result for consistency\.We draw two key observations from Table[1](https://arxiv.org/html/2605.14305#S4.T1)\. First, increasing the number of tokens decoded at each step can substantially improve inference throughput, but it may also lead to a pronounced degradation in generation quality\. For example, LLaDA/2 achieves approximately a2×2\\timesspeedup over LLaDA; however, its performance decreases on GSM8K, MATH, and HumanEval\. This suggests that simply increasing decoding parallelism, in the absence of an explicit correction mechanism, can exacerbate factorization errors and compromise prediction accuracy\. This observation validates our motivation that efficient parallel decoding for diffusion language models requires an effective correction mechanism to mitigate factorization errors\.

Second, FeF\-DLLM provides a stronger accuracy–efficiency trade\-off than the baseline, direct speculative decoding methods, and prior approaches for mitigating factorization errors\. Averaged over the four benchmarks, FeF\-DLLM with step=2 improves the mean accuracy from 46\.75 to 51\.79, achieving a 5\.04 point gain over LLaDA, while boosting the mean decoding speed to3\.86×3\.86\\times\. When using step=4, FeF\-DLLM further increases the mean accuracy to 52\.06, yielding a 5\.31 point improvement over LLaDA, while still providing a2\.33×2\.33\\timesspeedup\. Notably, FeF\-DLLM also surpasses SSD, a direct speculative decoding baseline, by 3\.21 and 3\.48 accuracy points under step=2 and step=4, respectively\. In addition, compared with factorization\-error mitigation methods such as DDOSP and DCD, FeF\-DLLM achieves substantially faster inference, showing that resampling\-based correction can effectively improve accuracy while preserving high decoding efficiency\.

### 4\.3Ablation Study

We conduct four ablation studies to evaluate the contributions of different components in FeF\-DLLM, focusing on the effects of training, speculative decoding, draft\-model selection and window size of speculative decoding\.

#### Ablation 1: Training Effects

LLaDA\-Instruct can also be directly used for inference with the proposed method without further finetuning its neural network parameters\. Meanwhile, to rule out the possibility that the observed improvements are primarily attributable to finetuning rather than to the proposed inference strategy, we also evaluate the trained model using the original inference procedure of LLaDA\-Instruct\. The results are reported in Table[2](https://arxiv.org/html/2605.14305#S4.T2)\.

Table 2:Ablation study on training effects\. Best accuracy values within each comparison group are in bold\. Mean denotes the average results across all four benchmarks\.The trained LLaDA model exhibits slight performance drops on GSM8K, MATH, and HumanEval, and only improves on MBPP\. In contrast, FeF\-DLLM with training consistently outperforms its untrained counterpart across all four benchmarks\. These results demonstrate that the gains achieved by our method are driven by the joint effect of finetuning and the proposed inference strategy\.

#### Ablation 2: Effect of Speculative Decoding

To demonstrate the acceleration effect of speculative decoding, we compare FeF\-DLLM with a counterpart that uses the same model but disables speculative decoding\. Table[3](https://arxiv.org/html/2605.14305#S4.T3)reports the results\. Here, FeF\-DLLM w/o SD denotes the non\-speculative setting\. The results show that incorporating speculative decoding substantially improves inference speed while preserving accuracy\.

Table 3:Ablation study on the effect of speculative decoding \(SD\)\. All models in this table are trained models\. Mean denotes the average results across all four benchmarks\.
#### Ablation 3: Choice of Draft Model

In all previous experiments, we use the same model as both the draft model and the verify model by default\. Here, we further analyze how the choices of draft models affect final performance\. The results are reported in Table[4](https://arxiv.org/html/2605.14305#S4.T4)\. In addition to task accuracy and wall\-clock speedup, we also report the average Acceptance, which measures the fraction of draft tokens accepted by the verify model\.

Table 4:Ablation study on the choice of draft models\. “A / B” denotes using A as the draft model and B as the verify model\. All results are reported with resampling step=4=4\.The results show that the final accuracy remains unchanged across the two draft\-model choices on all four benchmarks\. This suggests that, in this setting, the verify model primarily determines the final prediction quality\. Meanwhile, using the same model for both drafting and verification leads to consistently higher acceptance rates and slightly better speedup\. This observation is consistent with our analysis in Corollary[2](https://arxiv.org/html/2605.14305#S3.Ex20): when the draft model and the verify model are better aligned, the verify model accepts more draft tokens, which leads to more efficient speculative decoding\.

#### Ablation 4: Speculative Decoding Window Size

We further study the effect of the speculative decoding window size\. Table[5](https://arxiv.org/html/2605.14305#S4.T5)reports the results with resampling step=4=4\. The results show that increasing the window size substantially improves decoding speed while preserving the same accuracy\. Larger window sizes could not be implemented due to device limitations\. Therefore, a window size of 16 was selected for the formal experiment\.

Table 5:Ablation study on the speculative decoding window size\. All results are reported with resampling step=4=4\. Mean denotes the average results across all four benchmarks\.

## 5Limitation and Conclusion

In this paper, we proposed FeF\-DLLM, a factorization\-error\-free discrete diffusion language modeling method based on prefix\-conditioned clean\-token prediction\. By replacing independent token\-wise prediction with exact prefix\-conditioned factorization, FeF\-DLLM preserves token dependencies and eliminates factorization error\. We further incorporated speculative decoding to accelerate inference while maintaining the correctness of the target distribution\. Experiments on mathematical reasoning and code generation benchmarks show that FeF\-DLLM improves generation quality and achieves substantial inference speedup\.

One limitation of our method is that prefix\-conditioned prediction and speculative verification require more computational resources during inference than standard DLLM decoding\. Future work will explore more resource\-efficient implementations to further reduce this overhead\.

## References

- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. Van Den Berg \(2021a\)Structured denoising diffusion models in discrete state\-spaces\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.34,pp\. 17981–17993\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14305#S2.SS1.p1.4)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021b\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[§4\.2](https://arxiv.org/html/2605.14305#S4.SS2.p1.1)\.
- O\. Bar\-Tal, H\. Chefer, O\. Tov, C\. Herrmann, R\. Paiss, S\. Zada, A\. Ephrat, J\. Hur, G\. Liu, A\. Raj,et al\.\(2024\)Lumiere: a space\-time diffusion model for video generation\.InSIGGRAPH Asia 2024 Conference Papers,pp\. 1–11\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- T\. Bie, M\. Cao, K\. Chen, L\. Du, M\. Gong, Z\. Gong, Y\. Gu, J\. Hu, Z\. Huang, Z\. Lan, C\. Li, C\. Li, J\. Li, Z\. Li, H\. Liu, L\. Liu, G\. Lu, X\. Lu, Y\. Ma, J\. Tan, L\. Wei, J\. Wen, Y\. Xing, X\. Zhang, J\. Zhao, D\. Zheng, J\. Zhou, J\. Zhou, Z\. Zhou, L\. Zhu, and Y\. Zhuang \(2025\)LLaDA2\.0: scaling up diffusion language models to 100b\.arXiv preprint arXiv:2512\.15745\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- A\. Campbell, V\. De Bortoli, J\. Shi, and A\. Doucet \(2025\)Self\-speculative masked diffusions\.arXiv preprint arXiv:2510\.03929\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p2.1)\.
- C\. Chen, S\. Borgeaud, G\. Irving, J\. Lespiau, L\. Sifre, and J\. Jumper \(2023\)Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§2\.2](https://arxiv.org/html/2605.14305#S2.SS2.p1.11)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§4\.2](https://arxiv.org/html/2605.14305#S4.SS2.p1.1)\.
- Z\. Cheng, G\. Yang, J\. Li, Z\. Deng, M\. Guo, and S\. Hu \(2025\)DEER: draft with diffusion, verify with autoregressive models\.arXiv preprint arXiv:2512\.15176\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, Ł\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.2](https://arxiv.org/html/2605.14305#S4.SS2.p1.1)\.
- O\. Contributors \(2023\)OpenCompass: a universal evaluation platform for foundation models\.Note:[https://github\.com/open\-compass/opencompass](https://github.com/open-compass/opencompass)Cited by:[§4\.2](https://arxiv.org/html/2605.14305#S4.SS2.p1.1)\.
- Y\. Gao, Z\. Ji, Y\. Wang, B\. Qi, H\. Xu, and L\. Zhang \(2025\)Self speculative decoding for diffusion large language models\.arXiv preprint arXiv:2510\.04147\.Cited by:[Appendix A](https://arxiv.org/html/2605.14305#A1.p1.1),[§4\.2](https://arxiv.org/html/2605.14305#S4.SS2.p2.1),[Table 1](https://arxiv.org/html/2605.14305#S4.T1.15.15.15.6)\.
- I\. Gat, T\. Remez, N\. Shaul, F\. Kreuk, R\. T\. Chen, G\. Synnaeve, Y\. Adi, and Y\. Lipman \(2024\)Discrete flow matching\.Advances in Neural Information Processing Systems \(NeurIPS\)37,pp\. 133345–133385\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§4\.2](https://arxiv.org/html/2605.14305#S4.SS2.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- J\. Ho, T\. Salimans, A\. Gritsenko, W\. Chan, M\. Norouzi, and D\. J\. Fleet \(2022\)Video diffusion models\.Advances in Neural Information Processing Systems \(NeurIPS\)35,pp\. 8633–8646\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- H\. Lavenant and G\. Zanella \(2025\)Error bounds and optimal schedules for masked diffusions with factorized approximations\.arXiv preprint arXiv:2510\.25544\.Cited by:[Appendix A](https://arxiv.org/html/2605.14305#A1.p1.1),[§1](https://arxiv.org/html/2605.14305#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.14305#S4.SS2.p2.1),[Table 1](https://arxiv.org/html/2605.14305#S4.T1.20.20.20.6)\.
- Y\. Leviathan, M\. Kalman, and Y\. Matias \(2023\)Fast inference from transformers via speculative decoding\.InProceedings of the 40th International Conference on Machine Learning \(ICML\),pp\. 19274–19286\.Cited by:[§2\.2](https://arxiv.org/html/2605.14305#S2.SS2.p1.11),[§3\.3](https://arxiv.org/html/2605.14305#S3.SS3.SSS0.Px2.p1.2)\.
- Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2022\)Flow matching for generative modeling\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- A\. Liu, O\. Broadrick, M\. Niepert, and G\. Van den Broeck \(2024\)Discrete copula diffusion\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2605.14305#A1.p1.1),[§1](https://arxiv.org/html/2605.14305#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.14305#S4.SS2.p2.1),[Table 1](https://arxiv.org/html/2605.14305#S4.T1.25.25.25.6)\.
- A\. Lou, C\. Meng, and S\. Ermon \(2024\)Discrete diffusion language modeling by estimating the ratios of the data distribution\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- N\. Ma, M\. Goldstein, M\. S\. Albergo, N\. M\. Boffi, E\. Vanden\-Eijnden, and S\. Xie \(2024\)SiT: exploring flow and diffusion\-based generative models with scalable interpolant transformers\.InEuropean Conference on Computer Vision,pp\. 23–40\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou,et al\.\(2025\)LLaDA: large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1),[§4\.2](https://arxiv.org/html/2605.14305#S4.SS2.p2.1),[Table 1](https://arxiv.org/html/2605.14305#S4.T1.10.10.10.6),[Table 1](https://arxiv.org/html/2605.14305#S4.T1.5.5.5.6)\.
- W\. Peebles and S\. Xie \(2023\)Scalable diffusion models with transformers\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 4195–4205\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 10684–10695\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. T\. Chiu, A\. Rush, and V\. Kuleshov \(2024\)Simple and effective masked diffusion language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.37\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- J\. Song, C\. Meng, and S\. Ermon \(2020a\)Denoising diffusion implicit models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2020b\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- C\. Wang, H\. Peng, Y\. Liu, J\. Gu, and S\. Hu \(2025\)Diffusion models for 3d generation: a survey\.Computational Visual Media11\(1\),pp\. 1–28\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- R\. Wu, T\. Yang, L\. Sun, Z\. Zhang, S\. Li, and L\. Zhang \(2024\)SeeSR: towards semantics\-aware real\-world image super\-resolution\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 25456–25467\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- J\. Yim, H\. Stärk, G\. Corso, B\. Jing, R\. Barzilay, and T\. S\. Jaakkola \(2024\)Diffusion models in protein structure and docking\.Wiley Interdisciplinary Reviews: Computational Molecular Science14\(2\),pp\. e1711\.Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p1.1)\.
- J\. Yoo, W\. Kim, and S\. Hong \(2025\)ReDi: rectified discrete flow\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.14305#S1.p2.1)\.

## Appendix AImplementation Details of Compared Methods

We compare against three representative inference\-time baselines, namely SSD\[Gaoet al\.,[2025](https://arxiv.org/html/2605.14305#bib.bib30)\], DCD\[Liuet al\.,[2024](https://arxiv.org/html/2605.14305#bib.bib33)\], and DDOSP\[Lavenant and Zanella,[2025](https://arxiv.org/html/2605.14305#bib.bib37)\]\. For all comparing methods, we use the same LLaDA\-Instruct checkpoint, prompt formatting, maximum generation length, and OpenCompass evaluation pipeline as in the main experiments\. Unless otherwise stated, we keep the denoising step budget, block partition, temperature, classifier\-free guidance scale, and re\-masking rule identical to the LLaDA baseline, so that the differences come only from the inference strategy itself\.

#### SSD\.

We implement SSD as a training\-free decoding wrapper around the same LLaDA checkpoint\. At each denoising step, the model first predicts all masked positions in parallel and uses the top\-1 token at each masked position as the self\-draft token\. Candidate positions are selected within the current semi\-autoregressive decoding block according to model confidence, and a greedy linear verification tree is constructed\. All verification nodes are then packed into one batch and verified by the same LLaDA model\. Therefore, SSD does not introduce an auxiliary draft model or any additional training\. In the experiments, the same checkpoint is used as both the drafter and the verifier\.

#### DCD\.

The original DCD formulation requires a diffusion model and an autoregressive copula model that share exactly the same tokenizer and token\-to\-id mapping\. Because standard causal language models such as GPT\-2 are not tokenizer\-compatible with LLaDA, we implement a causalized DCD\-like baseline by reusing the same LLaDA checkpoint with a causal attention bias as the copula model\. At each denoising step, we compute diffusion logitsldiffl\_\{\\mathrm\{diff\}\}and causalized copula logitslcopl\_\{\\mathrm\{cop\}\}, and form the proposal logits as

lprop=lcop\+α​ldiff,l\_\{\\mathrm\{prop\}\}=l\_\{\\mathrm\{cop\}\}\+\\alpha l\_\{\\mathrm\{diff\}\},whereα\\alphacontrols the strength of the diffusion proposal and is set to 1\.0 by default\. For efficiency, we use the one\-pass causal proposal mode in the main experiments, namely one causal forward pass per denoising step, while the sequential mode is reserved for additional analysis because it is substantially slower\. Since LLaDA does not expose the same score/transition pair as the original Copula\-Diffusion implementation, this baseline should be regarded as a DCD\-style approximation rather than an exact reimplementation of the original method\.

#### DDOSP\.

DDOSP does not train a new model; instead, it replaces the handcrafted uniform unmasking schedule with a data\-driven non\-uniform schedule estimated from the denoiser\. Specifically, we use training\-set response sequences to estimate the information profilef​\(i\)f\(i\)under different numbers of revealed tokens, where the expectation is approximated by Monte Carlo masking and batched denoiser forward passes\. The estimated profile is then smoothed and further corrected to be monotone before computing the incremental information gainsΔ​f​\(i\)\\Delta f\(i\)\. Based on these gains, we construct aKK\-step decoding schedule, cache the resulting schedule file, and reuse it during inference\. During sampling, DDOSP only changes how many masked positions are released at each denoising round; the denoiser, token proposal rule, blockwise generation order, temperature, and re\-masking strategy remain the same as in the LLaDA baseline\. When no precomputed schedule is available, the implementation falls back to the uniform schedule\.

## Appendix BImplementation Details of Training

This section supplements the training details not explicitly reported in Section[4\.1](https://arxiv.org/html/2605.14305#S4.SS1)\. In addition to the hardware and optimization settings described there, we report the released supplementary training runs and their usage in evaluation\.

For the supplementary training setup, all runs were executed on 8 NVIDIA A100 80GB GPUs\. The math training run used 18k examples for 1 epoch and took approximately 17\.6 hours, corresponding to about 140\.8 GPU\-hours; the resulting model was used for evaluation on GSM8K and MATH\. The code training run used 54k examples for 1 epoch and took approximately 52\.8 hours, corresponding to about 422\.4 GPU\-hours; the resulting model was used for evaluation on HumanEval and MBPP\. Summing these two runs gives an approximate total of 563\.2 GPU\-hours for the released supplementary training runs\.

Additional implementation details are provided in Appendix[A](https://arxiv.org/html/2605.14305#A1)and Appendix[C](https://arxiv.org/html/2605.14305#A3)\.

## Appendix CAlgorithm

This section provides the detailed algorithms for FeF\-DLLM\. Algorithm[1](https://arxiv.org/html/2605.14305#alg1)summarizes the position\-conditioned training procedure\. Algorithm[2](https://arxiv.org/html/2605.14305#alg2)describes the full\-sequence inference procedure, which combines speculative verification with low\-confidence re\-corruption\.

Before presenting the inference algorithm, we define the residual distribution used when a draft token is rejected\. For a target lawπθjm\\pi\_\{\\theta\}^\{j\_\{m\}\}and a draft lawρϕjm\\rho\_\{\\phi\}^\{j\_\{m\}\}at positionjmj\_\{m\}, define

πθ′⁣jm​\(x\)=\[πθjm​\(x\)−ρϕjm​\(x\)\]\+1−∑z∈𝒱min⁡\{πθjm​\(z\),ρϕjm​\(z\)\},\\pi\_\{\\theta\}^\{\\prime j\_\{m\}\}\(x\)=\\frac\{\\left\[\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\)\-\\rho\_\{\\phi\}^\{j\_\{m\}\}\(x\)\\right\]\_\{\+\}\}\{1\-\\sum\_\{z\\in\\mathcal\{V\}\}\\min\\\{\\pi\_\{\\theta\}^\{j\_\{m\}\}\(z\),\\rho\_\{\\phi\}^\{j\_\{m\}\}\(z\)\\\}\},\(6\)where\[u\]\+=max⁡\{u,0\}\[u\]\_\{\+\}=\\max\\\{u,0\\\}\. If the denominator is zero, the rejection event has probability zero, and the residual distribution is never sampled\.

Algorithm 1Position\-Conditioned Training for FeF\-DLLM1:Predictor

pθp\_\{\\theta\}, data distribution

q​\(X0\)q\(X\_\{0\}\), response position set

JrespJ\_\{\\mathrm\{resp\}\}
2:Trained predictor

pθp\_\{\\theta\}
3:repeat

4:Sample clean data

X0∼q​\(X0\)X\_\{0\}\\sim q\(X\_\{0\}\)\.

5:Sample diffusion time

tt\.

6:Sample corrupted data

Xt∼q​\(Xt∣X0\)X\_\{t\}\\sim q\(X\_\{t\}\\mid X\_\{0\}\)\.

7:Define the ordered prediction positions:

8:

Jt=\{i∈Jresp:Xti≠X0i\}=\{j1,…,jLt\},j1<⋯<jLt,Lt=\|Jt\|\.J\_\{t\}=\\\{i\\in J\_\{\\mathrm\{resp\}\}:X\_\{t\}^\{i\}\\neq X\_\{0\}^\{i\}\\\}=\\\{j\_\{1\},\\dots,j\_\{L\_\{t\}\}\\\},\\qquad j\_\{1\}<\\cdots<j\_\{L\_\{t\}\},\\qquad L\_\{t\}=\|J\_\{t\}\|\.
9:Initialize loss

ℒ←0\\mathcal\{L\}\\leftarrow 0\.

10:for

m=1m=1to

LtL\_\{t\}do

11:Construct the position\-conditioned input

X\(jm\)X^\{\(j\_\{m\}\)\}:

12:

X\(jm\),i=\{X0i,i<jm,Xti,i≥jm\.X^\{\(j\_\{m\}\),i\}=\\begin\{cases\}X\_\{0\}^\{i\},&i<j\_\{m\},\\\\ X\_\{t\}^\{i\},&i\\geq j\_\{m\}\.\\end\{cases\}
13:Compute the token\-level cross\-entropy loss:

14:

CEjm=−log⁡pθ​\(X0jm∣X\(jm\),t\)\.\\mathrm\{CE\}\_\{j\_\{m\}\}=\-\\log p\_\{\\theta\}\\left\(X\_\{0\}^\{j\_\{m\}\}\\mid X^\{\(j\_\{m\}\)\},t\\right\)\.
15:Accumulate loss:

16:

ℒ←ℒ\+CEjm\.\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\+\\mathrm\{CE\}\_\{j\_\{m\}\}\.
17:endfor

18:Compute

∇θℒ\\nabla\_\{\\theta\}\\mathcal\{L\}and update

θ\\thetawith the optimizer\.

19:untilconverged

20:return

pθp\_\{\\theta\}

Algorithm[1](https://arxiv.org/html/2605.14305#alg1)trains the model to predict each clean token from a position\-conditioned input\. For positionjmj\_\{m\}, tokens beforejmj\_\{m\}are replaced by their clean values fromX0X\_\{0\}, while positionjmj\_\{m\}and the suffix remain in the corrupted stateXtX\_\{t\}\. This construction matches the verification\-time input used by the target model during left\-to\-right speculative verification, thereby aligning training with factorization\-error\-free inference\.

Algorithm 2Full\-Sequence FeF\-DLLM Decoding1:Initial corrupted sequence

XtX\_\{t\}, prediction positions

JpredJ\_\{\\mathrm\{pred\}\}, draft model

MρM\_\{\\rho\}, target model

MπM\_\{\\pi\}, window size

kk, number of steps

NstepN\_\{\\mathrm\{step\}\}, resampling budget

nrmn\_\{\\mathrm\{rm\}\}
2:Final generated sequence

X^0\\hat\{X\}\_\{0\}
3:Initialize

X←XtX\\leftarrow X\_\{t\}\.

4:for

s=0s=0to

Nstep−1N\_\{\\mathrm\{step\}\}\-1do

5:Initialize

𝒞\(s\)←∅\\mathcal\{C\}^\{\(s\)\}\\leftarrow\\emptyset\.

6:whilethere exists an unresolved position in

JpredJ\_\{\\mathrm\{pred\}\}do

7:Let

Jt=\{j1,…,jLt\}⊆JpredJ\_\{t\}=\\\{j\_\{1\},\\dots,j\_\{L\_\{t\}\}\\\}\\subseteq J\_\{\\mathrm\{pred\}\}, with

j1<⋯<jLtj\_\{1\}<\\cdots<j\_\{L\_\{t\}\}, be the unresolved positions\.

8:Set

b←min⁡\{k,Lt\}b\\leftarrow\\min\\\{k,L\_\{t\}\\\}\.

9:Compute draft laws

ρϕjm\(⋅\)=ρϕ\(X0jm=⋅∣X,t\)\\rho\_\{\\phi\}^\{j\_\{m\}\}\(\\cdot\)=\\rho\_\{\\phi\}\(X\_\{0\}^\{j\_\{m\}\}=\\cdot\\mid X,t\)for

m=1,…,bm=1,\\dots,b\.

10:Sample draft tokens

X~0jm∼ρϕjm​\(⋅\)\\tilde\{X\}\_\{0\}^\{j\_\{m\}\}\\sim\\rho\_\{\\phi\}^\{j\_\{m\}\}\(\\cdot\)for

m=1,…,bm=1,\\dots,b\.

11:Construct verification inputs

X\(jm\)X^\{\(j\_\{m\}\)\}for

m=1,…,bm=1,\\dots,b,

12:each input uses the verified prefix and the draft prefix

X~0j1,…,X~0jm−1\\tilde\{X\}\_\{0\}^\{j\_\{1\}\},\\dots,\\tilde\{X\}\_\{0\}^\{j\_\{m\-1\}\}\.

13:Compute target laws

πθjm\(⋅\)=pθ\(X0jm=⋅∣X\(jm\),t\)\\pi\_\{\\theta\}^\{j\_\{m\}\}\(\\cdot\)=p\_\{\\theta\}\(X\_\{0\}^\{j\_\{m\}\}=\\cdot\\mid X^\{\(j\_\{m\}\)\},t\)for

m=1,…,bm=1,\\dots,b\.

14:for

m=1m=1to

bbdo

15:Sample

um∼Uniform​\(0,1\)u\_\{m\}\\sim\\mathrm\{Uniform\}\(0,1\)\.

16:if

um≤min⁡\{1,πθjm​\(X~0jm\)/ρϕjm​\(X~0jm\)\}u\_\{m\}\\leq\\min\\\{1,\\pi\_\{\\theta\}^\{j\_\{m\}\}\(\\tilde\{X\}\_\{0\}^\{j\_\{m\}\}\)/\\rho\_\{\\phi\}^\{j\_\{m\}\}\(\\tilde\{X\}\_\{0\}^\{j\_\{m\}\}\)\\\}then

17:

Xjm←X~0jmX^\{j\_\{m\}\}\\leftarrow\\tilde\{X\}\_\{0\}^\{j\_\{m\}\},

cjm←πθjm​\(X~0jm\)c\_\{j\_\{m\}\}\\leftarrow\\pi\_\{\\theta\}^\{j\_\{m\}\}\(\\tilde\{X\}\_\{0\}^\{j\_\{m\}\}\)\.

18:

𝒞\(s\)←𝒞\(s\)∪\{\(jm,cjm\)\}\\mathcal\{C\}^\{\(s\)\}\\leftarrow\\mathcal\{C\}^\{\(s\)\}\\cup\\\{\(j\_\{m\},c\_\{j\_\{m\}\}\)\\\}\.

19:else

20:Sample

Zjm∼πθ′⁣jm​\(⋅\)Z^\{j\_\{m\}\}\\sim\\pi\_\{\\theta\}^\{\\prime j\_\{m\}\}\(\\cdot\)according to Eq\.[6](https://arxiv.org/html/2605.14305#A3.E6)\.

21:

Xjm←ZjmX^\{j\_\{m\}\}\\leftarrow Z^\{j\_\{m\}\},

cjm←πθjm​\(Zjm\)c\_\{j\_\{m\}\}\\leftarrow\\pi\_\{\\theta\}^\{j\_\{m\}\}\(Z^\{j\_\{m\}\}\)\.

22:Discard the remaining draft tokens and restart from the updated

XX\.

23:

𝒞\(s\)←𝒞\(s\)∪\{\(jm,cjm\)\}\\mathcal\{C\}^\{\(s\)\}\\leftarrow\\mathcal\{C\}^\{\(s\)\}\\cup\\\{\(j\_\{m\},c\_\{j\_\{m\}\}\)\\\}\.

24:break

25:endif

26:endfor

27:endwhile

28:

X^0\(s\)←X\\hat\{X\}\_\{0\}^\{\(s\)\}\\leftarrow X\.

29:if

s<Nstep−1s<N\_\{\\mathrm\{step\}\}\-1then

30:Select

R\(s\+1\)=LowConfnrm⁡\(𝒞\(s\)\)R^\{\(s\+1\)\}=\\operatorname\{LowConf\}\_\{n\_\{\\mathrm\{rm\}\}\}\(\\mathcal\{C\}^\{\(s\)\}\)\.

31:foreach

i∈R\(s\+1\)i\\in R^\{\(s\+1\)\}do

32:Re\-corrupt

Xi∼q​\(Xti∣X^0\(s\),i\)X^\{i\}\\sim q\(X\_\{t\}^\{i\}\\mid\\hat\{X\}\_\{0\}^\{\(s\),i\}\)\.

33:endfor

34:endif

35:endfor

36:return

X^0=X^0\(Nstep−1\)\\hat\{X\}\_\{0\}=\\hat\{X\}\_\{0\}^\{\(N\_\{\\mathrm\{step\}\}\-1\)\}

Algorithm[2](https://arxiv.org/html/2605.14305#alg2)describes the full\-sequence inference procedure of FeF\-DLLM\. At each generation step, the method repeatedly applies speculative decoding over the unresolved prediction positions\. The draft modelMρM\_\{\\rho\}proposes up tokkcandidate clean tokens in the current speculative window, and the target modelMπM\_\{\\pi\}verifies these candidates in a fixed left\-to\-right order using position\-conditioned inputs\. Accepted or corrected tokens are written back to the sequence and serve as clean prefix context for subsequent positions\. If all candidates in the window are accepted, the whole window is committed; if one candidate is rejected, a corrected token is sampled from the residual distribution, the remaining draft tokens in the window are discarded, and a new speculative window starts from the updated sequence\. After all prediction positions are resolved, the method selects low\-confidence positions and re\-corrupts them through the D3PM forward kernel for the next generation step\. This procedure preserves the accept–reject correction of speculative decoding while allowing low\-confidence positions to be refined through repeated generation and re\-corruption steps\.

## Appendix DTheoretical Proofs

### D\.1Proof of Lemma[1](https://arxiv.org/html/2605.14305#Thmlemma1)

###### Proof\.

For simplicity, we omit the timestep index in the probability notation\. By the chain rule and the position\-wise corruption assumption, we have

p​\(X0i∣Xt,X0<i\)\\displaystyle p\(X\_\{0\}^\{i\}\\mid X\_\{t\},X\_\{0\}^\{<i\}\)=p​\(X0i∣Xt\)​p​\(X0<i∣Xt,X0i\)p​\(X0<i∣Xt\)\\displaystyle=\\frac\{p\(X\_\{0\}^\{i\}\\mid X\_\{t\}\)p\(X\_\{0\}^\{<i\}\\mid X\_\{t\},X\_\{0\}^\{i\}\)\}\{p\(X\_\{0\}^\{<i\}\\mid X\_\{t\}\)\}=p​\(X0i∣Xt\)​p​\(Xt∣X0<i,X0i\)​p​\(X0<i∣X0i\)p​\(X0<i∣Xt\)​p​\(Xt∣X0i\)\\displaystyle=\\frac\{p\(X\_\{0\}^\{i\}\\mid X\_\{t\}\)p\(X\_\{t\}\\mid X\_\{0\}^\{<i\},X\_\{0\}^\{i\}\)p\(X\_\{0\}^\{<i\}\\mid X\_\{0\}^\{i\}\)\}\{p\(X\_\{0\}^\{<i\}\\mid X\_\{t\}\)p\(X\_\{t\}\\mid X\_\{0\}^\{i\}\)\}=p​\(X0i∣Xt\)​p​\(Xt<i∣X0<i\)​p​\(Xti∣X0i\)​p​\(Xt\>i∣X0i,X0<i\)​p​\(X0<i∣X0i\)p​\(X0<i∣Xt\)​p​\(Xti∣X0i\)​p​\(Xt−i∣X0i\)\\displaystyle=\\frac\{p\(X\_\{0\}^\{i\}\\mid X\_\{t\}\)p\(X\_\{t\}^\{<i\}\\mid X\_\{0\}^\{<i\}\)p\(X\_\{t\}^\{i\}\\mid X\_\{0\}^\{i\}\)p\(X\_\{t\}^\{\>i\}\\mid X\_\{0\}^\{i\},X\_\{0\}^\{<i\}\)p\(X\_\{0\}^\{<i\}\\mid X\_\{0\}^\{i\}\)\}\{p\(X\_\{0\}^\{<i\}\\mid X\_\{t\}\)p\(X\_\{t\}^\{i\}\\mid X\_\{0\}^\{i\}\)p\(X\_\{t\}^\{\-i\}\\mid X\_\{0\}^\{i\}\)\}=p​\(X0i∣Xt\)​p​\(Xt<i∣X0<i\)​p​\(Xt\>i∣X0i,X0<i\)​p​\(X0<i∣X0i\)p​\(X0<i∣Xt\)​p​\(Xt−i∣X0i\)\\displaystyle=\\frac\{p\(X\_\{0\}^\{i\}\\mid X\_\{t\}\)p\(X\_\{t\}^\{<i\}\\mid X\_\{0\}^\{<i\}\)p\(X\_\{t\}^\{\>i\}\\mid X\_\{0\}^\{i\},X\_\{0\}^\{<i\}\)p\(X\_\{0\}^\{<i\}\\mid X\_\{0\}^\{i\}\)\}\{p\(X\_\{0\}^\{<i\}\\mid X\_\{t\}\)p\(X\_\{t\}^\{\-i\}\\mid X\_\{0\}^\{i\}\)\}∝p​\(X0i∣Xt\)​p​\(Xt\>i∣X0i,X0<i\)​p​\(X0<i∣X0i\)p​\(Xt−i∣X0i\)\\displaystyle\\propto\\frac\{p\(X\_\{0\}^\{i\}\\mid X\_\{t\}\)p\(X\_\{t\}^\{\>i\}\\mid X\_\{0\}^\{i\},X\_\{0\}^\{<i\}\)p\(X\_\{0\}^\{<i\}\\mid X\_\{0\}^\{i\}\)\}\{p\(X\_\{t\}^\{\-i\}\\mid X\_\{0\}^\{i\}\)\}∝p​\(X0i∣Xt\)​p​\(X0i∣Xt\>i,X0<i\)p​\(X0i∣Xt−i\)\.\\displaystyle\\propto p\(X\_\{0\}^\{i\}\\mid X\_\{t\}\)\\frac\{p\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\>i\},X\_\{0\}^\{<i\}\)\}\{p\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\-i\}\)\}\.Moreover, since the forward corruption process factorizes across positions,Xt<iX\_\{t\}^\{<i\}is generated only fromX0<iX\_\{0\}^\{<i\}\. Hence, onceX0<iX\_\{0\}^\{<i\}is conditioned on,Xt<iX\_\{t\}^\{<i\}provides no additional information aboutX0iX\_\{0\}^\{i\}, which gives

p​\(X0i∣Xt,X0<i\)=p​\(X0i∣Xt≥i,X0<i\)\.p\(X\_\{0\}^\{i\}\\mid X\_\{t\},X\_\{0\}^\{<i\}\)=p\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\\geq i\},X\_\{0\}^\{<i\}\)\.The first equality follows from the chain rule of conditional probabilities\. The second equality expandsp​\(X0<i∣Xt,X0i\)p\(X\_\{0\}^\{<i\}\\mid X\_\{t\},X\_\{0\}^\{i\}\)by Bayes’ rule\. The third equality decomposes the corrupted sequence into prefix, current position, and suffix under the position\-wise forward corruption process, with the suffix term understood as the marginal likelihood after integrating outX0\>iX\_\{0\}^\{\>i\}\. The fourth equality cancels the common factorp​\(Xti∣X0i\)p\(X\_\{t\}^\{i\}\\mid X\_\{0\}^\{i\}\)\. The first proportionality absorbs all terms independent ofX0iX\_\{0\}^\{i\}into the normalization constant\. The last proportionality rewrites the remaining likelihood ratio in terms ofp​\(X0i∣Xt\>i,X0<i\)p\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\>i\},X\_\{0\}^\{<i\}\)andp​\(X0i∣Xt−i\)p\(X\_\{0\}^\{i\}\\mid X\_\{t\}^\{\-i\}\)\. The proportionality is overX0iX\_\{0\}^\{i\}, with normalization over the vocabulary\. ∎

### D\.2Proof of Theorem[1](https://arxiv.org/html/2605.14305#S3.Ex15)

###### Proof\.

Fix the corrupted stateXtX\_\{t\}and the ordered update set

Jt=\{j1,…,jLt\},j1<⋯<jLt\.J\_\{t\}=\\\{j\_\{1\},\\dots,j\_\{L\_\{t\}\}\\\},\\qquad j\_\{1\}<\\cdots<j\_\{L\_\{t\}\}\.For eachjm∈Jtj\_\{m\}\\in J\_\{t\}, the target model is instantiated as the oracle next\-position law

πθjm​\(x\)=p⋆​\(X0jm=x∣Xt≥jm,X0<jm,t\),m=1,…,Lt\.\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\)=p^\{\\star\}\\left\(X\_\{0\}^\{j\_\{m\}\}=x\\mid X\_\{t\}^\{\\geq j\_\{m\}\},X\_\{0\}^\{<j\_\{m\}\},t\\right\),\\qquad m=1,\\dots,L\_\{t\}\.Letρϕjm\\rho\_\{\\phi\}^\{j\_\{m\}\}denote the corresponding draft law\.

We first verify the one\-step correction\. Condition on the already committed prefix before positionjmj\_\{m\}\. Then bothπθjm\\pi\_\{\\theta\}^\{j\_\{m\}\}andρϕjm\\rho\_\{\\phi\}^\{j\_\{m\}\}are fixed categorical distributions on𝒱\\mathcal\{V\}\. A proposalX~0jm∼ρϕjm\\tilde\{X\}\_\{0\}^\{j\_\{m\}\}\\sim\\rho\_\{\\phi\}^\{j\_\{m\}\}is accepted with probability

a​\(X~0jm\)=min⁡\{1,πθjm​\(X~0jm\)ρϕjm​\(X~0jm\)\}\.a\(\\tilde\{X\}\_\{0\}^\{j\_\{m\}\}\)=\\min\\left\\\{1,\\frac\{\\pi\_\{\\theta\}^\{j\_\{m\}\}\(\\tilde\{X\}\_\{0\}^\{j\_\{m\}\}\)\}\{\\rho\_\{\\phi\}^\{j\_\{m\}\}\(\\tilde\{X\}\_\{0\}^\{j\_\{m\}\}\)\}\\right\\\}\.We use the convention

ρϕjm​\(x\)​min⁡\{1,πθjm​\(x\)ρϕjm​\(x\)\}=min⁡\{πθjm​\(x\),ρϕjm​\(x\)\},\\rho\_\{\\phi\}^\{j\_\{m\}\}\(x\)\\min\\left\\\{1,\\frac\{\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\)\}\{\\rho\_\{\\phi\}^\{j\_\{m\}\}\(x\)\}\\right\\\}=\\min\\\{\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\),\\rho\_\{\\phi\}^\{j\_\{m\}\}\(x\)\\\},which also covers the caseρϕjm​\(x\)=0\\rho\_\{\\phi\}^\{j\_\{m\}\}\(x\)=0\. If the proposal is rejected, the corrected token is sampled from

πθ′⁣jm​\(x\)=\[πθjm​\(x\)−ρϕjm​\(x\)\]\+1−∑z∈𝒱min⁡\{πθjm​\(z\),ρϕjm​\(z\)\},\\pi\_\{\\theta\}^\{\\prime j\_\{m\}\}\(x\)=\\frac\{\\left\[\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\)\-\\rho\_\{\\phi\}^\{j\_\{m\}\}\(x\)\\right\]\_\{\+\}\}\{1\-\\sum\_\{z\\in\\mathcal\{V\}\}\\min\\\{\\pi\_\{\\theta\}^\{j\_\{m\}\}\(z\),\\rho\_\{\\phi\}^\{j\_\{m\}\}\(z\)\\\}\},where\[u\]\+=max⁡\{u,0\}\[u\]\_\{\+\}=\\max\\\{u,0\\\}\. If the denominator is zero, rejection occurs with probability zero, and the residual law is never used\.

For anyx∈𝒱x\\in\\mathcal\{V\}, the probability thatxxis committed at positionjmj\_\{m\}is

Pr⁡\(X^0jm=x\)\\displaystyle\\Pr\(\\hat\{X\}\_\{0\}^\{j\_\{m\}\}=x\)=ρϕjm​\(x\)​min⁡\{1,πθjm​\(x\)ρϕjm​\(x\)\}\+\(1−∑z∈𝒱min⁡\{πθjm​\(z\),ρϕjm​\(z\)\}\)​πθ′⁣jm​\(x\)\\displaystyle=\\rho\_\{\\phi\}^\{j\_\{m\}\}\(x\)\\min\\left\\\{1,\\frac\{\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\)\}\{\\rho\_\{\\phi\}^\{j\_\{m\}\}\(x\)\}\\right\\\}\+\\left\(1\-\\sum\_\{z\\in\\mathcal\{V\}\}\\min\\\{\\pi\_\{\\theta\}^\{j\_\{m\}\}\(z\),\\rho\_\{\\phi\}^\{j\_\{m\}\}\(z\)\\\}\\right\)\\pi\_\{\\theta\}^\{\\prime j\_\{m\}\}\(x\)=min⁡\{πθjm​\(x\),ρϕjm​\(x\)\}\+πθjm​\(x\)−min⁡\{πθjm​\(x\),ρϕjm​\(x\)\}\\displaystyle=\\min\\\{\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\),\\rho\_\{\\phi\}^\{j\_\{m\}\}\(x\)\\\}\+\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\)\-\\min\\\{\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\),\\rho\_\{\\phi\}^\{j\_\{m\}\}\(x\)\\\}=πθjm​\(x\)\.\\displaystyle=\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x\)\.Thus, given the committed prefix, the token committed by the speculative accept–reject step has lawπθjm\\pi\_\{\\theta\}^\{j\_\{m\}\}\.

Applying this argument sequentially alongj1<⋯<jLtj\_\{1\}<\\cdots<j\_\{L\_\{t\}\}, for anyxJt=\(xj1,…,xjLt\)∈𝒱Ltx\_\{J\_\{t\}\}=\(x^\{j\_\{1\}\},\\dots,x^\{j\_\{L\_\{t\}\}\}\)\\in\\mathcal\{V\}^\{L\_\{t\}\},

Pr⁡\(X^0,Jt=xJt∣Xt,t\)\\displaystyle\\Pr\\left\(\\hat\{X\}\_\{0,J\_\{t\}\}=x\_\{J\_\{t\}\}\\mid X\_\{t\},t\\right\)=∏m=1LtPr⁡\(X^0jm=xjm∣Xt≥jm,x0<jm,t\)\\displaystyle\\quad=\\prod\_\{m=1\}^\{L\_\{t\}\}\\Pr\\left\(\\hat\{X\}\_\{0\}^\{j\_\{m\}\}=x^\{j\_\{m\}\}\\mid X\_\{t\}^\{\\geq j\_\{m\}\},x\_\{0\}^\{<j\_\{m\}\},t\\right\)=∏m=1Ltπθjm​\(xjm\)\\displaystyle\\quad=\\prod\_\{m=1\}^\{L\_\{t\}\}\\pi\_\{\\theta\}^\{j\_\{m\}\}\(x^\{j\_\{m\}\}\)=∏m=1Ltp⋆​\(X0jm=xjm∣Xt≥jm,x0<jm,t\)\\displaystyle\\quad=\\prod\_\{m=1\}^\{L\_\{t\}\}p^\{\\star\}\\left\(X\_\{0\}^\{j\_\{m\}\}=x^\{j\_\{m\}\}\\mid X\_\{t\}^\{\\geq j\_\{m\}\},x\_\{0\}^\{<j\_\{m\}\},t\\right\)=p⋆​\(xJt∣Xt,t\)\.\\displaystyle\\quad=p^\{\\star\}\(x\_\{J\_\{t\}\}\\mid X\_\{t\},t\)\.The last equality follows from the left\-to\-right factorization of the oracle joint law over the ordered setJtJ\_\{t\}\. Hence the generated sequence has the desired oracle joint law\. ∎

### D\.3Proof of Corollary[1](https://arxiv.org/html/2605.14305#S3.Ex17)

###### Proof\.

Fix a resampling passs≥1s\\geq 1\. Let

R\(s\)=\{r1\(s\),…,rLs\(s\)\},r1\(s\)<⋯<rLs\(s\),R^\{\(s\)\}=\\\{r\_\{1\}^\{\(s\)\},\\dots,r\_\{L\_\{s\}\}^\{\(s\)\}\\\},\\qquad r\_\{1\}^\{\(s\)\}<\\cdots<r\_\{L\_\{s\}\}^\{\(s\)\},be the positions selected for resampling\. We condition on the previous sequenceX^0\(s−1\)\\hat\{X\}\_\{0\}^\{\(s\-1\)\}and on the selected setR\(s\)R^\{\(s\)\}\. Under this conditioning, all positions outsideR\(s\)R^\{\(s\)\}are fixed, and the resampling pass only updates positions inR\(s\)R^\{\(s\)\}\.

By the assumption of the corollary, the target model onR\(s\)R^\{\(s\)\}is given by the oracle next\-position law\. Therefore, the same one\-step accept–reject argument used in the proof of Theorem 1 implies that each committed token has the corresponding oracle law given the already committed prefix and the fixed outside positions\. Applying the chain rule along the order

r1\(s\)<⋯<rLs\(s\),r\_\{1\}^\{\(s\)\}<\\cdots<r\_\{L\_\{s\}\}^\{\(s\)\},we obtain, for anyxR\(s\)∈𝒱Lsx\_\{R^\{\(s\)\}\}\\in\\mathcal\{V\}^\{L\_\{s\}\},

Pr⁡\(X^0,R\(s\)\(s\)=xR\(s\)∣X^0\(s−1\),R\(s\),t\)\\displaystyle\\Pr\\left\(\\hat\{X\}\_\{0,R^\{\(s\)\}\}^\{\(s\)\}=x\_\{R^\{\(s\)\}\}\\mid\\hat\{X\}\_\{0\}^\{\(s\-1\)\},R^\{\(s\)\},t\\right\)=p⋆​\(xR\(s\)∣X^0\(s−1\),R\(s\),t\)\.\\displaystyle\\quad=p^\{\\star\}\\left\(x\_\{R^\{\(s\)\}\}\\mid\\hat\{X\}\_\{0\}^\{\(s\-1\)\},R^\{\(s\)\},t\\right\)\.Thus, conditional on the resampling history and the selected resampling set, the regeneration step has the oracle joint law on the updated positions\. ∎

### D\.4Proof of Theorem[2](https://arxiv.org/html/2605.14305#S3.Ex18)

###### Proof\.

In one speculative round, candidates are verified from left to right until the first rejection or until allkkcandidates are accepted\. LetCkC\_\{k\}denote the number of positions committed in this round\. Since a rejected position is also committed after correction, the eventCk\>ℓC\_\{k\}\>\\ellis equivalent to accepting the firstℓ\\elldraft tokens\.

Under the independent\-acceptance approximation, each draft token is accepted with probabilityα\\alpha\. Hence, forℓ=0,…,k−1\\ell=0,\\dots,k\-1,

Pr⁡\(Ck\>ℓ\)=αℓ\.\\Pr\(C\_\{k\}\>\\ell\)=\\alpha^\{\\ell\}\.By the tail\-sum formula for nonnegative integer\-valued random variables,

𝔼​\[Ck\]\\displaystyle\\mathbb\{E\}\[C\_\{k\}\]=∑ℓ=0k−1Pr⁡\(Ck\>ℓ\)\\displaystyle=\\sum\_\{\\ell=0\}^\{k\-1\}\\Pr\(C\_\{k\}\>\\ell\)=∑ℓ=0k−1αℓ\.\\displaystyle=\\sum\_\{\\ell=0\}^\{k\-1\}\\alpha^\{\\ell\}\.Evaluating the geometric series gives

𝔼​\[Ck\]=\{1−αk1−α,0≤α<1,k,α=1\.\\mathbb\{E\}\[C\_\{k\}\]=\\begin\{cases\}\\dfrac\{1\-\\alpha^\{k\}\}\{1\-\\alpha\},&0\\leq\\alpha<1,\\\\\[8\.0pt\] k,&\\alpha=1\.\\end\{cases\}∎

### D\.5Proof of Corollary[2](https://arxiv.org/html/2605.14305#S3.Ex20)

###### Proof\.

A speculative round consists of one draft proposal pass and one target verification pass, with wall\-clock costscρc\_\{\\rho\}andcπc\_\{\\pi\}, respectively\. Thus, its expected cost iscρ\+cπc\_\{\\rho\}\+c\_\{\\pi\}, and by Theorem 2 it commits𝔼​\[Ck\]\\mathbb\{E\}\[C\_\{k\}\]positions in expectation\.

The prefix\-conditioned sequential target baseline commits one position per target pass, with costcπc\_\{\\pi\}\. Therefore, its expected cost for committing𝔼​\[Ck\]\\mathbb\{E\}\[C\_\{k\}\]positions is𝔼​\[Ck\]​cπ\\mathbb\{E\}\[C\_\{k\}\]c\_\{\\pi\}\. The idealized acceleration ratio is consequently

S=𝔼​\[Ck\]​cπcρ\+cπ\.S=\\frac\{\\mathbb\{E\}\[C\_\{k\}\]\\,c\_\{\\pi\}\}\{c\_\{\\rho\}\+c\_\{\\pi\}\}\.ifcρ≈cπc\_\{\\rho\}\\approx c\_\{\\pi\}, then

S=𝔼​\[Ck\]2\.S=\\frac\{\\mathbb\{E\}\[C\_\{k\}\]\}\{2\}\.Substituting the expression for𝔼​\[Ck\]\\mathbb\{E\}\[C\_\{k\}\]from Theorem 2 yields

S=\{1−αk2​\(1−α\),0≤α<1,k2,α=1\.S=\\begin\{cases\}\\dfrac\{1\-\\alpha^\{k\}\}\{2\(1\-\\alpha\)\},&0\\leq\\alpha<1,\\\\\[8\.0pt\] \\dfrac\{k\}\{2\},&\\alpha=1\.\\end\{cases\}∎

Similar Articles