LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection

arXiv cs.LG Papers

Summary

This paper introduces LEAP, a training-free method to accelerate inference in Diffusion Language Models (dLLMs) by detecting early-converging tokens, reducing denoising steps by 30% without losing accuracy.

arXiv:2605.10980v1 Announce Type: new Abstract: Diffusion Language Models (dLLMs) have garnered significant attention for their potential in highly parallel processing. The parallel capabilities of existing dLLMs stem from the assumption of conditional independence at high confidence levels, which ensures negligible discrepancy between the marginal and joint distributions. However, the stringent confidence thresholds required to preserve accuracy severely constrain the scalability of parallelism. Through systematic token-level statistical analysis, we reveal that a substantial proportion of tokens converge to their correct predictions early in the denoising process yet fail to reach standard confidence thresholds, confirming that current confidence-based criteria are overly conservative. In response, we introduce LEAP (Lookahead Early-Convergence Token Detection for Accelerated Parallel Decoding). LEAP is a training-free, plug-and-play method that leverages future context filtering and multi-sequence superposition to detect early-converging tokens. By validating the alignment between early convergence and correctness, we enable reliable early decoding of these tokens. Benchmarking across diverse domains demonstrates that LEAP significantly lowers inference latency and decoding steps. Compared to confidence-based decoding, the average number of denoising steps is reduced by about 30%. On the GSM8K dataset, combining LEAP with dParallel accelerates decoding to 7.2 tokens per step while preserving model precision. LEAP effectively breaks the reliance on high-confidence priors, offering a novel paradigm for parallel decoding.
Original Article
View Cached Full Text

Cached at: 05/13/26, 06:24 AM

# LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
Source: [https://arxiv.org/html/2605.10980](https://arxiv.org/html/2605.10980)
Haohui Zhang Shanghai Jiao Tong University zhanghaohui@sjtu\.edu\.cn &Zhiye Wang Shanghai Jiao Tong University 21\-wzy@sjtu\.edu\.cn &Xiaoying Gan Shanghai Jiao Tong University ganxiaoying@sjtu\.edu\.cn &Xinbing Wang Shanghai Jiao Tong University xwang8@sjtu\.edu\.cn &Bo Jiang Shanghai Jiao Tong University bjiang@sjtu\.edu\.cn

###### Abstract

Diffusion Language Models \(dLLMs\) have garnered significant attention for their potential in highly parallel processing\. The parallel capabilities of existing dLLMs stem from the assumption of conditional independence at high confidence levels, which ensures negligible discrepancy between the marginal and joint distributions\. However, the stringent confidence thresholds required to preserve accuracy severely constrain the scalability of parallelism\. Through systematic token\-level statistical analysis, we reveal that a substantial proportion of tokens converge to their correct predictions early in the denoising process yet fail to reach standard confidence thresholds, confirming that current confidence\-based criteria are overly conservative\. In response, we introduce LEAP \(Lookahead Early\-Convergence Token Detection for Accelerated Parallel Decoding\)\. LEAP is a training\-free, plug\-and\-play method that leverages future context filtering and multi\-sequence superposition to detect early\-converging tokens\. By validating the alignment between early convergence and correctness, we enable reliable early decoding of these tokens\. Benchmarking across diverse domains demonstrates that LEAP significantly lowers inference latency and decoding steps\. Compared to confidence\-based decoding, the average number of denoising steps is reduced by about 30%\. On the GSM8K dataset, combining LEAP with dParallel accelerates decoding to 7\.2 tokens per step while preserving model precision\. LEAP effectively breaks the reliance on high\-confidence priors, offering a novel paradigm for parallel decoding\.

![Refer to caption](https://arxiv.org/html/2605.10980v1/x1.png)

Figure 1:Illustration of the ‘early convergence’ phenomenon in diffusion language models\.The figure displays the denoising generation process fromT=0T=0toT=NT=N\. In confidence\-based decoding strategy, only tokens with high confidence are decoded \(marked in green\)\. The red box highlights the token "generate," which is correctly predicted early atT=1T=1and remains stable throughout subsequent steps\. However, it fails to be decoded until the final stages because its low confidence score, demonstrating the limitation of confidence\-based decoding\.## 1Introduction

Autoregressive large language models \(AR\-LLMs\) have long dominated the field of language modeling\(Achiamet al\.,[2023](https://arxiv.org/html/2605.10980#bib.bib1); Yanget al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib2); Liuet al\.,[2024](https://arxiv.org/html/2605.10980#bib.bib3)\)\. However, their inherent sequential generation process restricts further improvements in inference speed\. Consequently, diffusion large language models \(dLLMs\) have emerged as a promising new paradigm\(Nieet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib4); Bieet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib7); Yeet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib5); Liuet al\.,[2025a](https://arxiv.org/html/2605.10980#bib.bib6)\), garnering significant attention for their potential to highly parallel generation\. Recent works have achieved inference speeds exceeding those of AR\-LLMs by mitigating per\-step denoising costs and scaling the number of tokens processed simultaneously\(Wanget al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib8); Liuet al\.,[2025b](https://arxiv.org/html/2605.10980#bib.bib9),[a](https://arxiv.org/html/2605.10980#bib.bib6); Wuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib10)\)\.

While the potential for high\-speed generation is well\-established, the actual parallelism of the current model is still relatively low\. Existing parallel decoding methods generally select multiple high\-confidence tokens using marginal probabilities\(Wuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib10)\)\. Yet, such parallelization assumes token independence\. Decoupling interdependent tokens often violates their semantic dependencies, resulting in performance drops, especially in reasoning scenarios requiring strict logical coherence\. As a result, state\-of\-the\-art models are restricted to decoding fewer tokens per step, limiting the extensibility of parallel generation\.

The fundamental reason for the precision degradation in high parallelism scenarios is the discrepancy between independent sampling from marginal probabilities and sampling from the true joint distribution during parallel decoding\. Parallel sampling that ignores dependencies can introduce significant bias\. Existing methods mitigate this by leveraging high\-confidence tokens\(Wuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib10)\)—which satisfy independence assumptions\. While confidence\-based sampling targets high\-confidence tokens to minimize inter\-dependency issues, the practical scarcity of such tokens limits the effective degree of parallelism\. Relaxing the confidence threshold induces sampling bias due to dependencies, resulting in significant accuracy degradation\.

We identify the limitations as the reliance on confidence metrics, which manifest in two primary aspects\. First, confidence\-based decoding methods maintain model accuracy by limiting the parallel candidates to only high\-confidence tokens\. However, our token\-level statistical analysis reveals that numerous medium\-confidence tokens already converge to their correct predictions early in the denoising process, indicating that high confidence is not a necessary condition for safe decoding\. Second, high\-confidence tokens inherently contribute less information, leading to a greater number of total decoding steps\(Fuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib11)\)\. Consequently, high\-confidence decoding yields a low\-information context for subsequent steps, further inhibiting the potential for parallelism\. Therefore, a key challenge remains: how to simultaneously expand parallelism and increase information contribution per step to minimize iteration steps, without sacrificing accuracy?

We introduceLookaheadEarly\-Convergence Detection forAcceleratedParallelism Decoding \(LEAP\), a training\-free, plug\-and\-play parallel decoding strategy\. We find empirically that a high proportion of medium\-confidence tokens exhibit early correctness and convergence, which implies that a large number of forward steps are performing repeated predictions on them\. These tokens show low sensitivity—their predictions stabilize early and demonstrate robustness to future contextual changes\. Capitalizing on this, we propose a convergence detection strategy based on future context perturbation\. By contrasting the predictions of the current context with those of a superposed context containing potential future information, we identify tokens that exhibit low sensitivity and high robustness to future updates, thereby enabling their early decoding\. The feasibility of this strategy stems from the novel future context candidate pruning and multi\-sequence superimposed consistency detection strategy we propose\. Decode these medium\-confidence tokens in advance not only enhances the parallelism in current step, but also triggers further token generation and amplifies parallelism of future due to their higher entropy and information contribution\.

We conduct extensive evaluations on two popular open\-source dLLMs, LLaDA and Dream, covering mathematics, code generation, and multi\-disciplinary QA\. Empirical results show that LEAP improves generation parallelism across all benchmarks, decreasing latency by around 30% against confidence\-based decoding strategy\. Meanwhile, LEAP slightly improves average accuracy on LLaDA\. Further analysis confirms that LEAP establishes a better Pareto frontier between speed and accuracy compared to confidence\-based decoding strategy\.

![Refer to caption](https://arxiv.org/html/2605.10980v1/x2.png)
![Refer to caption](https://arxiv.org/html/2605.10980v1/x3.png)

Figure 2:\(a\)Confidence distribution of early decodable tokens for GSM8K with LLaDA\-8B\-Instruct\.The red line denotes Early Correct, and the blue line denotes Early Correct & Converged\. \(b\)Confidence distribution of ground\-truth tokens at the preceding time step\.The histogram and red curve represent the probability density and CDF, respectively\. The annotation \(Cum\.P=0\.1,x≈0\.32P=0\.1,x\\approx 0\.32\) indicates that only 10% of tokens have confidence below 0\.32\.![Refer to caption](https://arxiv.org/html/2605.10980v1/x4.png)

Figure 3:Overview of LEAP\.Given a partially denoised sequence at stept−1t\-1, LEAP first performs future\-context candidate pruning: for each masked position, only plausible future tokens whose confidence exceeds a loose thresholdη\\etaare retained\. These candidates, together with copied mask tokens, are appended to the original sequence while preserving their original position IDs, forming a superposed contextxts​u​px\_\{t\}^\{sup\}\. This allows LEAP to compare predictions under the original context and lookahead\-perturbed context within a single forward pass\. At steptt, a token is decoded early only if its prediction remains consistent under the two contexts and its confidence exceedsτ\\tau; otherwise, it stays masked for later refinement\. In the example, \(B\) and \(D\) satisfy the consistency check and are unmasked early, whereas \(A\) is kept masked\.
## 2Related Work

Diffusion Language Models\.Recently, diffusion\-based Large Language Models \(dLLMs\) have evolved into a distinct paradigm of high\-performance foundation models, diverging from the standard autoregressive framework\. Open\-source efforts, notably the LLaDA series, have validated pure diffusion architectures trained from scratch; LLaDA\(Nieet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib4)\)and LLaDA 2\.0\(Bieet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib7)\)leverage masked prediction to match autoregressive baselines \(e\.g\., LLaMA 3\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.10980#bib.bib12)\)\) at the 8B scale and confirm scaling laws via a 100B\-parameter MoE variant\. Dream improves the downstream task performance of dLLMs while maintaining parallelism through autoregressive model weight initialization\. In the commercial domain, dLLMs are increasingly prominent\. Commercial models like Google DeepMind’s Gemini Diffusion, Inception Labs’ Mercury, and ByteDance’s Seed Diffusion\(Songet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib13)\)demonstrate superior inference speeds, highlighting their utility in latency\-critical applications\.

Acceleration of dLLMs\.Although dLLMs exhibit considerable potential for efficient generation, they remain constrained by a speed\-accuracy trade\-off\. Recent research addresses this challenge through two primary strategies: reducing the cost per denoising step and increasing the decoding parallelism\. The first strategy mainly focuses on solving the problem that traditional KV\-Cache is not applicable to dLLMs\. Works like Fast\-dLLM\-Cache\(Wuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib10)\), dKV\-Cache\(Maet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib14)\), have introduced approximate caching for bidirectional attention, while Refusion\(Liet al\.,[2025a](https://arxiv.org/html/2605.10980#bib.bib19)\)and WeDLM\(Liuet al\.,[2025a](https://arxiv.org/html/2605.10980#bib.bib6)\)adapt to KV\-Cache through hybrid attention\. The second strategy focuses on maximizing the number of unmasked tokens per step without degrading precision\. Fast\-dLLM\-Parallel\(Wuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib10)\)and EB\-Sampler\(Ben\-Hamuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib16)\)select tokens with low joint dependency based on confidence and entropy, respectively\. D2F\(Wanget al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib8)\)achieves inter\-block parallel decoding through distillation\. DParallel\(Chenet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib17)\)leverages deterministic information as a training signal to enhance overall model confidence, thereby accelerating parallel sampling\. Prophet\(Liet al\.,[2025b](https://arxiv.org/html/2605.10980#bib.bib20)\)focuses on global convergence and uses confidence gap for early committing\. KLASS\(Kimet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib28)\)introduces token\-level KL divergence between consecutive timesteps as a stability criterion, unmasking tokens only when both high confidence and low KL are satisfied, thereby reducing premature decoding errors\. LoPA\(Xuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib29)\)samples multiple candidate branches and selects the one with the highest future branch confidence\. Despite diverse efforts to accelerate dLLMs, existing paradigms still predominantly utilize parallel decoding schemes characterized by high confidence and low entropy\. This imposes a severe bottleneck on the decoding budget—specifically the token count—at each step\. Unlike previous methods, we employ a training\-free mechanism to detect early\-converged tokens with medium confidence, thereby enabling higher parallelism without incurring the performance penalty associated with lower confidence thresholds\.

## 3Methodology

### 3\.1Preliminary

#### 3\.1\.1Diffusion Language Models

Diffusion Language Models \(DLMs\) model text generation as a discrete diffusion process involving a forward corruption phase and a reverse denoising phase\. The forward process corrupts a clean sequence𝐱0\\mathbf\{x\}\_\{0\}into a masked state𝐱t\\mathbf\{x\}\_\{t\}at timesteptt\. It can be formulated this as a marginal distribution where each token is independently masked with probability1−αt1\-\\alpha\_\{t\}:

q​\(𝐱t\|𝐱0\)=∏i=1L\[αt​𝕀​\(xti=x0i\)\+\(1−αt\)​𝕀​\(xti=\[M\]\)\],q\(\\mathbf\{x\}\_\{t\}\|\\mathbf\{x\}\_\{0\}\)=\\prod\_\{i=1\}^\{L\}\\left\[\\alpha\_\{t\}\\mathbb\{I\}\(x\_\{t\}^\{i\}=x\_\{0\}^\{i\}\)\+\(1\-\\alpha\_\{t\}\)\\mathbb\{I\}\(x\_\{t\}^\{i\}=\\texttt\{\[M\]\}\)\\right\],\(1\)where\[M\]denotes the mask token\. Asttincreases, the signal\-to\-noise ratioαt\\alpha\_\{t\}decreases monotonically\. The reverse process learns to reconstruct the original data from the corrupted state\. A neural networkpθp\_\{\\theta\}is trained to predict the original tokens for all masked positions simultaneously\. The learning objective minimizes the negative log\-likelihood over the masked indicesMtM\_\{t\}:

ℒ​\(θ\)=𝔼t,𝐱0,𝐱t​\[−∑i∈Mtlog⁡pθ​\(x0i∣𝐱t\)\]\.\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{t,\\mathbf\{x\}\_\{0\},\\mathbf\{x\}\_\{t\}\}\\left\[\-\\sum\_\{i\\in M\_\{t\}\}\\log p\_\{\\theta\}\(x\_\{0\}^\{i\}\\mid\\mathbf\{x\}\_\{t\}\)\\right\]\.\(2\)

#### 3\.1\.2Confidence\-Based Parallel Decoding

To accelerate inference, Fast\-dLLM\(Wuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib10)\)employ Confidence\-Based Parallel Decoding\(CBPD\) strategy that iteratively fixes tokens based on predictive certainty\. At each steptt, given the current state𝐱t\\mathbf\{x\}\_\{t\}, the model predicts the probability distribution over the vocabulary for all masked positions\. CBPD identify the set of high confidence positions, denoted as𝒮t\\mathcal\{S\}\_\{t\}, where the model’s top prediction probability exceeds a scalar thresholdϕ\\phi:

𝒮t=\{i∈Mt\|maxv⁡pθ​\(xi=v∣𝐱t\)\>ϕ\}\.\\mathcal\{S\}\_\{t\}=\\left\\\{i\\in M\_\{t\}\\;\\middle\|\\;\\max\_\{v\}p\_\{\\theta\}\(x\_\{i\}=v\\mid\\mathbf\{x\}\_\{t\}\)\>\\phi\\right\\\}\.\(3\)The state is then updated by unmasking these high\-confidence tokens with their greedy predictions, while keeping uncertain positions masked for future refinement:

xt−1i=\{arg​maxv⁡pθ​\(xi=v∣𝐱t\)if​i∈𝒮t\.\[M\]otherwise\.x\_\{t\-1\}^\{i\}=\\begin\{cases\}\\operatorname\*\{arg\\,max\}\_\{v\}p\_\{\\theta\}\(x\_\{i\}=v\\mid\\mathbf\{x\}\_\{t\}\)&\\text\{if \}i\\in\\mathcal\{S\}\_\{t\}\.\\\\ \\texttt\{\[M\]\}&\\text\{otherwise\.\}\\end\{cases\}\(4\)If𝒮t\\mathcal\{S\}\_\{t\}is empty, CBPD enforces an update on the single position with the highest global confidence to guarantee convergence\.

### 3\.2Barriers to Parallel Decoding

Table 1:Performances of LLaDA\-8B\-Instruct with various confidence decoding thresholds\.CBPD selects a subset of tokens whose confidence scores exceed a specific threshold during each denoising step\. The theoretical underpinning of this strategy rests on the independence assumption under high confidence: when the model is sufficiently confident, the marginal probability approximates the sampling result of the joint distribution\(Wuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib10)\)\.

However, the strict confidence threshold in CBPD creates a bottleneck for the parallel decoding candidate set\. In practice, we identify a large volume of ‘early correct’ tokens—predictions that match the final high\-confidence output before the threshold is met\. As shown by the red line in Fig\.[2](https://arxiv.org/html/2605.10980#S1.F2)\(a\), more than half of the tokens in the medium confidence range \(\[0\.5,0\.9\)\[0\.5,0\.9\)\) are early correct\. Furthermore, we observe ‘early correct and converged’ tokens, which maintain stable and correct predictions throughout the pre\-threshold phase\. Fig\.[2](https://arxiv.org/html/2605.10980#S1.F2)\(a\) also illustrates the gap between these two categories across different confidence levels\. At higher confidence levels\(≥0\.6\\geq 0\.6\), the curves exhibit a minimal gap, suggesting a high degree of overlap between these token subsets\. This indicates that CBPD unnecessarily discards valid candidates\. Naively reducing the threshold is ineffective, as it violates the high\-confidence independence assumption and results in a substantial deviation between the joint and marginal distributions \(Table[1](https://arxiv.org/html/2605.10980#S3.T1)\)\.

### 3\.3Problem Formulation

Formally, letτi\\tau\_\{i\}denote the specific decoding step where theii\-th token is decoded in CBPD\. The prediction for this token is obtained by selecting the candidate with the highest probability, conditioned on the context𝐱τi−1\\mathbf\{x\}^\{\\tau\_\{i\}\-1\}available at that step:

π​\(i∣𝐱τi−1\):=xi∗=argmaxv∈𝒱pθ​\(xi=v∣𝐱τi−1\),\\pi\(i\\mid\\mathbf\{x\}^\{\\tau\_\{i\}\-1\}\):=x\_\{i\}^\{\*\}=\\operatorname\*\{argmax\}\_\{v\\in\\mathcal\{V\}\}\\,p\_\{\\theta\}\\left\(x\_\{i\}=v\\mid\\mathbf\{x\}^\{\\tau\_\{i\}\-1\}\\right\),\(5\)where functionπ\\pirepresents greedy decoding\. We takexi∗x\_\{i\}^\{\*\}obtained from theτi\\tau\_\{i\}\-step prediction as the final convergence target\. To maximize parallelism, iterationttshould decode all tokens that align with the final converged outcome\. Formally, we define the maximal decodable set at stepttas:

𝒟t∗=\{i∈Mt\|π​\(i∣𝐱t−1\)=xi∗\},\\mathcal\{D\}\_\{t\}^\{\*\}=\\left\\\{i\\in M\_\{t\}\\;\\middle\|\\;\\pi\(i\\mid\\mathbf\{x\}^\{t\-1\}\)=x\_\{i\}^\{\*\}\\right\\\},\(6\)whereMtM\_\{t\}is the set of candidate mask tokens in steptt\.

However, the direct computation ofxi∗x\_\{i\}^\{\*\}at stepttis precluded by the unavailability of the future context𝐱τi−1\\mathbf\{x\}^\{\\tau\_\{i\}\-1\}\. Thus, the challenge lies in maximizing the decodable set𝒟t\\mathcal\{D\}\_\{t\}at stepttwhile introducing minimal deviation from the original converged output\.

### 3\.4Lookahead Early\-Convergence Token Detection

We propose LEAP, a parallel accelerated decoding strategy that identifies early converged tokens to unlock higher decoding parallelism\. Fig\.[3](https://arxiv.org/html/2605.10980#S1.F3)shows an overview of the method\.

Early convergence signifies consistency between current and future prediction outcomes\. Ideally, we expect converged tokens to demonstrate perfect robustness; specifically, given the current context, the result should remain unaffected by the inclusion of any further context\. We formally define a token at indexiias an ideally converged token at stepttif it exhibits prediction invariance with respect to future context updates\. LetΩ​\(𝐱t−1\)\\Omega\(\\mathbf\{x\}^\{t\-1\}\)denote the set of all valid future contexts reachable from the current state𝐱t−1\\mathbf\{x\}^\{t\-1\}\. The perfect converged tokeniishould satisfy:

∀𝐜∈Ω​\(𝐱t−1\),π​\(i∣𝐜\)=π​\(i∣𝐱t−1\)\.\\forall\\mathbf\{c\}\\in\\Omega\(\\mathbf\{x\}^\{t\-1\}\),\\quad\\pi\(i\\mid\\mathbf\{c\}\)=\\pi\(i\\mid\\mathbf\{x\}^\{t\-1\}\)\.\(7\)In practice, achieving acceleration for early\-converged tokens detection requires ensuring that the introduced computational overhead remains negligible relative to the resulting efficiency gains\. To this end, we propose a novel future\-context pruning with superimposed decoding strategy\. This approach minimizes the overhead of lookahead detection through two key optimizations: \(1\) future\-context candidate pruning, which leverages historical information to filter a superset of potential new contexts for the subsequent timestep, and \(2\) multi\-sequence superimposed consistency detection, which enables lookahead consistency checks without necessitating additional forward passes\.

Future Context Candidate Pruning\.Motivated by the ‘early correct’ observation in section[3\.2](https://arxiv.org/html/2605.10980#S3.SS2), we posit that the correct answer often appears among the top candidates with non\-negligible confidence, even when greedy decoding fails\. To verify this, we tracked the confidence of ground\-truth tokens at the preceding time step\. Fig\.[2](https://arxiv.org/html/2605.10980#S1.F2)\(b\) confirms that over 98% of such tokens maintained a confidence of at least 0\.3\. This finding suggests a look\-ahead pre\-filtering strategy, where a candidate superset for the subsequent step is selected using a lower threshold\. We defineη\\etaas the minimum confidence threshold and proactively filter the candidate set for stepttafter the forward pass at stept−1t\-1:

𝒮it=\{v∈𝒱∣pθ​\(xi=v∣𝐱t−2\)\>=η\}\.\\mathcal\{S\}\_\{i\}^\{t\}=\\left\\\{v\\in\\mathcal\{V\}\\mid p\_\{\\theta\}\(x\_\{i\}=v\\mid\\mathbf\{x\}^\{t\-2\}\)\>=\\eta\\right\\\}\.\(8\)
![Refer to caption](https://arxiv.org/html/2605.10980v1/x5.png)Figure 4:Attention mask for isolating sequence\.Multi\-Sequence Superimposed Consistency Detection\.Early convergence indicates that a token remains stable given new context\. Treating emerging context as a perturbation, we seek tokens that are robust to potential future variations\. A naive approach would be to compute predictions conditioned on all possible future contexts and select tokens that maintain consistency across these predictions, but the search space becomes intractable due to the combinatorial explosion of candidates\. To address this, we propose a Multi\-Sequence Context Superposition strategy\. As shown in Fig\.[3](https://arxiv.org/html/2605.10980#S1.F3), we approximate future perturbations by appending all potential candidate tokens to the current sequence\. Additionally, we append copies of all masked tokens to facilitate simultaneous prediction under both original and perturbed contexts in a single forward pass\. All additional markers retain their original positional encoding to maintain the original contextual logical order\. By leveraging the attention mask, these copies observe the full augmented context—excluding the appended tokens at their corresponding positions, as these mask copies represent tokens that have not yet been decoded\. Meanwhile, the original sequence is isolated from the appended tokens, ensuring independent inference pathways\. The detailed attention mask design that achieves this isolation is illustrated in Fig\.[4](https://arxiv.org/html/2605.10980#S3.F4)\. This yields predictions based on the original sequence alongside those subject to future perturbations\. We identify early\-converged tokens as those maintaining consistency under perturbation\. Formally, the superimposed sequence in stepttis defined as:

xsupt=xt⊕⨁i∈Mt\(\[xit\]⊕𝒮it\),x^\{t\}\_\{\\text\{sup\}\}=x^\{t\}\\oplus\\bigoplus\_\{i\\in M\_\{t\}\}\\left\(\[x\_\{i\}^\{t\}\]\\oplus\\mathcal\{S\}^\{t\}\_\{i\}\\right\),\(9\)where⊕\\oplusdenotes concatenation\. Then we identify tokens that remain consistent post\-perturbation as early\-converged tokens, defined as:

𝒟t=\{i∈Mt∣y^i=π​\(i∣xsupt\)\},\\mathcal\{D\}\_\{t\}=\\\{i\\in M\_\{t\}\\mid\\hat\{y\}\_\{i\}=\\pi\(i\\mid x^\{t\}\_\{\\text\{sup\}\}\)\\\},\(10\)wherey^i=π​\(i∣xt−1\)\\hat\{y\}\_\{i\}=\\pi\(i\\mid x^\{t\-1\}\)denotes the current prediction ofii\-th token\. However, considering that practical scenarios often involve tokens with invalid predictions—characterized by high entropy and low contextual shift—we adopt a trade\-off strategy by adding a loose confidence thresholdτ\\tauto Eq\. \(3\)\. The token we selected for early decoding is formalized as:

𝒟^t=\{i∈𝒟t∣pθ​\(xi=y^i∣xt−1\)≥τ\}\.\\hat\{\\mathcal\{D\}\}\_\{t\}=\\left\\\{i\\in\\mathcal\{D\}\_\{t\}\\mid p\_\{\\theta\}\(x\_\{i\}=\\hat\{y\}\_\{i\}\\mid x^\{t\-1\}\)\\geq\\tau\\right\\\}\.\(11\)
Table 2:Main results on five benchmarks under two base models\. “–” indicates that the method does not provide an implementation for the corresponding model\.DatasetMethodLLaDA\-8B\-InstructDream\-7B\-InstructAcc↑\\uparrowTPS↑\\uparrowSteps↓\\downarrowSpd\(Lat\.\)↑\\uparrowAcc↑\\uparrowTPS↑\\uparrowSteps↓\\downarrowSpd\(Lat\.\)↑\\uparrowGSM8K\(4\-shot\)Baseline76\.38\.42561\.0×\\times78\.25\.72561\.0×\\timesLoPA––––75\.612\.5732\.2×\\timesKLASS78\.815\.21151\.8×\\times76\.112\.31132\.2×\\timesConf\-Based76\.527\.2793\.2×\\times77\.624\.7604\.2×\\times\\cellcolorgray\!15LEAP\\cellcolorgray\!1578\.0\\cellcolorgray\!1534\.3\\cellcolorgray\!1558\\cellcolorgray\!154\.0×\\times\\cellcolorgray\!1576\.9\\cellcolorgray\!1531\.3\\cellcolorgray\!1544\\cellcolorgray\!155\.5×\\timesHumanEval\(0\-shot\)Baseline42\.310\.85121\.0×\\times56\.73\.45121\.0×\\timesLoPA––––52\.412\.4764\.0×\\timesKLASS42\.720\.12191\.9×\\times55\.512\.7944\.9×\\timesConf\-Based42\.130\.21762\.8×\\times54\.326\.0657\.5×\\times\\cellcolorgray\!15LEAP\\cellcolorgray\!1542\.7\\cellcolorgray\!1536\.5\\cellcolorgray\!15125\\cellcolorgray\!153\.3×\\times\\cellcolorgray\!1554\.3\\cellcolorgray\!1533\.9\\cellcolorgray\!1550\\cellcolorgray\!158\.8×\\timesMBPP\(3\-shot\)Baseline37\.01\.05121\.0×\\times55\.01\.05121\.0×\\timesLoPA––––54\.610\.03410\.1×\\timesKLASS40\.215\.2677\.6×\\times59\.69\.2549\.1×\\timesConf\-Based36\.211\.94511\.3×\\times54\.817\.82718\.1×\\times\\cellcolorgray\!15LEAP\\cellcolorgray\!1537\.0\\cellcolorgray\!1521\.6\\cellcolorgray\!1522\\cellcolorgray\!1520\.3×\\times\\cellcolorgray\!1553\.0\\cellcolorgray\!1525\.2\\cellcolorgray\!1518\\cellcolorgray\!1525\.0×\\timesMATH\(4\-shot\)Baseline33\.48\.92561\.0×\\times42\.68\.42561\.0×\\timesLoPA––––41\.711\.21071\.3×\\timesKLASS32\.314\.71361\.5×\\times37\.713\.61321\.9×\\timesConf\-Based33\.122\.31012\.5×\\times42\.124\.1882\.9×\\times\\cellcolorgray\!15LEAP\\cellcolorgray\!1532\.4\\cellcolorgray\!1527\.6\\cellcolorgray\!1573\\cellcolorgray\!153\.1×\\times\\cellcolorgray\!1540\.0\\cellcolorgray\!1531\.5\\cellcolorgray\!1563\\cellcolorgray\!153\.7×\\timesGPQA\(5\-shot\)Baseline30\.64\.02561\.0×\\times33\.30\.52561\.0×\\timesLoPA––––30\.67\.6620\.7×\\timesKLASS29\.010\.11242\.1×\\times32\.611\.6824\.8×\\timesConf\-Based31\.98\.3289\.1×\\times32\.122\.3549\.6×\\times\\cellcolorgray\!15LEAP\\cellcolorgray\!1532\.1\\cellcolorgray\!1512\.1\\cellcolorgray\!1517\\cellcolorgray\!1513\.3×\\times\\cellcolorgray\!1532\.4\\cellcolorgray\!1522\.0\\cellcolorgray\!154\\cellcolorgray\!1562\.0×\\times

## 4Experiment

### 4\.1Experimental Setup

Benchmarks\.We evaluate LEAP on representative tasks across diverse domains: GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.10980#bib.bib22)\), MATH\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.10980#bib.bib27)\), and GPQA\(Reinet al\.,[2024](https://arxiv.org/html/2605.10980#bib.bib23)\)for science and math reasoning; HumanEval\(Chen,[2021](https://arxiv.org/html/2605.10980#bib.bib24)\)and MBPP\(Austinet al\.,[2021](https://arxiv.org/html/2605.10980#bib.bib26)\)for code generation\. We report four metrics: accuracy \(Acc\), tokens per second \(TPS\), the number of denoising steps \(Steps\), and wall\-clock latency speedup relative to the full\-step baseline \(Spd\(Lat\.\)\)\. Furthermore, to demonstrate that our method is orthogonal to model\-centric acceleration techniques, we integrate it with dParallel, a distillation\-based method designed to enhance dLLM parallelism\.

Baselines\.We compare LEAP against three decoding strategies: \(1\)Conf\-Based, confidence threshold\-based parallel decoding\(Wuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib10)\)with the threshold fixed at 0\.9, \(2\)KLASS\(Kimet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib28)\)and \(3\)LoPA\(Xuet al\.,[2025](https://arxiv.org/html/2605.10980#bib.bib29)\)\.

Implementation Details\.Our experiments are conducted on two open\-source dLLMs: LLaDA\-8B\-Instruct and Dream\-7B\-Instruct\. Both models utilize a semi\-autoregressive generation strategy with a block size of 32\. Unless otherwise specified, the hyperparameters are set toη=0\.2\\eta=0\.2andτ=0\.7\\tau=0\.7\. All experiments are performed on a single NVIDIA 5090 \(32GB\) GPU\.

Table 3:The combination of LEAP with dParallel on the LLaDA\-8B\-Instruct model\.
### 4\.2Main Results

Results on LLaDA and Dream\.As shown in Table[2](https://arxiv.org/html/2605.10980#S3.T2), LEAP consistently achieves the highest decoding speed across most benchmarks on both base models\. On LLaDA\-8B\-Instruct, LEAP attains an average speedup of6\.7×6\.7\\timesover the baseline, substantially outperforming confidence\-based decoding \(4\.9×4\.9\\times\) and KLASS \(1\.8×1\.8\\times\)\. Meanwhile, LEAP maintains competitive or superior accuracy—average accuracy rises from 47\.2% \(baseline\) and 47\.4% \(Conf\-Based\) to 47\.8%, suggesting that early decoding of converged tokens positively influences subsequent generation\. On Dream\-7B\-Instruct, a similar trend is observed: LEAP achieves an average speedup of10\.6×10\.6\\times, clearly surpassing confidence\-based decoding \(8\.3×8\.3\\times\), KLASS \(4\.6×4\.6\\times\), and LoPA \(4\.0×4\.0\\times\), while preserving accuracy comparable to confidence\-based decoding\. Across both models, the speedup gains stem from our early\-convergence detection strategy, which expands the decodable set per step and substantially reduces the total denoising iterations\.

![Refer to caption](https://arxiv.org/html/2605.10980v1/x6.png)Figure 5:Performance vs\. Efficiency Trade\-off on HumanEval\.Performance vs\. Efficiency Analysis\.To further examine the practical efficiency of LEAP on code generation, we report the trade\-off between HumanEval Pass@1 and computational cost, measured by TFLOPS, in Fig\.[5](https://arxiv.org/html/2605.10980#S4.F5)\. As shown in the figure, in the usable Pass@1 range, LEAP achieves comparable or higher accuracy with lower TFLOPS than the confidence\-based baseline\. This suggests that LEAP reaches effective code generation performance with less computation, reducing redundant inference cost while preserving task performance\. The shaded region highlights this efficiency advantage over confidence\-based decoding\.

Results on dParallel\.We further integrated our method into LLaDA\-8B\-Instruct distilled by dParallel\. By incorporating confidence signals into the distillation process, dParallel accelerates token confidence convergence, thereby mitigating parallelism bottlenecks\. However, dParallel remains unable to leverage early\-converging tokens efficiently\. We evaluated the integration of LEAP and dParallel on mathematical reasoning and code generation tasks, which typically require long\-sequence generation\. As shown in Table[3](https://arxiv.org/html/2605.10980#S4.T3), our method significantly expands parallelism while maintaining comparable accuracy, increasing TPF from 3\.7 to 5\.4, with an increase of 46%\. This gain is consistent with that observed in the standard LLaDA model, demonstrating that LEAP is orthogonal to dParallel and further validating its generalizability\.

![Refer to caption](https://arxiv.org/html/2605.10980v1/x7.png)\(a\)HumanEval
![Refer to caption](https://arxiv.org/html/2605.10980v1/x8.png)\(b\)GSM8K
![Refer to caption](https://arxiv.org/html/2605.10980v1/x9.png)\(c\)HumanEval
![Refer to caption](https://arxiv.org/html/2605.10980v1/x10.png)\(d\)GSM8K

Figure 6:\(a\-b\) Impact of thresholdτ\\tauon accuracy and NFE\. \(c\-d\) Impact of the thresholdη\\etaon accuracy and normalized TFOPs \(Token Forward Operations\), where TFOPs are normalized based on the confidence\-based decoding scheme\.![Refer to caption](https://arxiv.org/html/2605.10980v1/x11.png)Figure 7:Per\-step overhead analysis of LEAP on LLaDA\-8B\-Instruct\.
### 4\.3Hyperparameter Analysis

Impact of thresholdτ\\tau\.We analyze the impact of the thresholdτ\\tauon LLaDA\-8B\-Instruct by evaluating its performance on HumanEval and GSM8K, varyingτ\\taufrom 0\.55 to 0\.8 with an interval of 0\.05\. As illustrated in Fig\.[6\(a\)](https://arxiv.org/html/2605.10980#S4.F6.sf1)and Fig\.[6\(b\)](https://arxiv.org/html/2605.10980#S4.F6.sf2), the computational overhead—specifically the number of denoising steps—increases linearly with the threshold\. Regarding accuracy, HumanEval performance drops significantly whenτ\\taufalls below 0\.6, remaining relatively stable above 0\.65\. Conversely, GSM8K exhibits a unimodal distribution peaking atτ=0\.7\\tau=0\.7\. These results suggest that an excessively low threshold should be avoided, as it tends to select tokens distant from the context, which often manifest as high entropy and nonsensical repetitions\. Consequently, we setτ=0\.7\\tau=0\.7to achieve an optimal balance between efficiency and accuracy\.

Impact of thresholdη\\eta\.Given that the thresholdη\\etadetermines the number of tokens computed, we introduce the metric of Token Forward Operations \(TFOPs\)\. One TFOP represents the computational load of processing a single token through a single forward pass\. To measure the total computational cost of inference, we sum the TFOPs across all timesteps\. We assess the LLaDA\-8B\-Instruct model on GSM8K and HumanEval withη∈\[0\.1,0\.5\]\\eta\\in\[0\.1,0\.5\]\(step size 0\.05\), normalizing computation against confidence\-based decoding\. As shown in Fig\.[6\(c\)](https://arxiv.org/html/2605.10980#S4.F6.sf3)and Fig\.[6\(d\)](https://arxiv.org/html/2605.10980#S4.F6.sf4), while computational cost rises slowly with a lowerη\\eta, model accuracy remains stable\. Crucially, LEAP maintains a 20%–30% efficiency gain \(in TFOP\) over the confidence\-based decoding baseline due to fewer total timesteps\. Consequently, we selectη=0\.2\\eta=0\.2to balance efficiency and accuracy\.

### 4\.4Per\-Step Overhead Analysis

A natural concern regarding LEAP is the per\-step overhead from the superimposed sequence\. The additional length is bounded by two factors: the pruning thresholdη\\etarestricts candidates per position, and\|Mt\|\|M\_\{t\}\|diminishes as decoding progresses\. As shown in Fig\.[7](https://arxiv.org/html/2605.10980#S4.F7), even under aggressive settings, LEAP introduces only moderate per\-step increases in both token count and latency\. Crucially, the TFOPs analysis \(Fig\.[6\(c\)](https://arxiv.org/html/2605.10980#S4.F6.sf3)and Fig\.[6\(d\)](https://arxiv.org/html/2605.10980#S4.F6.sf4)\) confirms that the total computation remains 20%–30% lower than confidence\-based decoding, as the reduction in denoising steps more than compensates for the per\-step overhead\. This validates that the superimposed consistency detection is a worthwhile investment—trading bounded per\-step cost for global acceleration\.

## 5Conclusion

In this paper, we introduce Lookahead Early\-Convergence Token Detection for Accelerated Parallel Decoding \(LEAP\)\. By integrating future context candidate pruning with multi\-sequence superimposed consistency detection, LEAP detects and decodes early converged tokens at a low computational cost\. Empirical results show that our method significantly improves model parallelism and decreases inference latency while largely preserves accuracy\. Crucially, LEAP alleviates the strict reliance on high\-confidence decoding of dLLM parallel decoding, opening new avenues for parallel inference research\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p1.1)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[§4\.1](https://arxiv.org/html/2605.10980#S4.SS1.p1.1)\.
- H\. Ben\-Hamu, I\. Gat, D\. Severo, N\. Nolte, and B\. Karrer \(2025\)Accelerated sampling from masked diffusion models via entropy bounded unmasking\.arXiv preprint arXiv:2505\.24857\.Cited by:[§2](https://arxiv.org/html/2605.10980#S2.p2.1)\.
- T\. Bie, M\. Cao, K\. Chen, L\. Du, M\. Gong, Z\. Gong, Y\. Gu, J\. Hu, Z\. Huang, Z\. Lan,et al\.\(2025\)Llada2\. 0: scaling up diffusion language models to 100b\.arXiv preprint arXiv:2512\.15745\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p1.1),[§2](https://arxiv.org/html/2605.10980#S2.p1.1)\.
- M\. Chen \(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§4\.1](https://arxiv.org/html/2605.10980#S4.SS1.p1.1)\.
- Z\. Chen, G\. Fang, X\. Ma, R\. Yu, and X\. Wang \(2025\)Dparallel: learnable parallel decoding for dllms\.arXiv preprint arXiv:2509\.26488\.Cited by:[§2](https://arxiv.org/html/2605.10980#S2.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2605.10980#S4.SS1.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[§2](https://arxiv.org/html/2605.10980#S2.p1.1)\.
- H\. Fu, B\. Huang, V\. Adams, C\. Wang, V\. Srinivasan, and J\. Jiao \(2025\)From bits to rounds: parallel decoding with exploration for diffusion language models\.arXiv preprint arXiv:2511\.21103\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p4.1)\.
- S\. H\. Kim, S\. Hong, H\. Jung, Y\. Park, and S\. Yun \(2025\)KLASS: kl\-guided fast inference in masked diffusion models\.arXiv preprint arXiv:2511\.05664\.Cited by:[§2](https://arxiv.org/html/2605.10980#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.10980#S4.SS1.p2.1)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo,et al\.\(2022\)Solving quantitative reasoning problems with language models\.Advances in neural information processing systems35,pp\. 3843–3857\.Cited by:[§4\.1](https://arxiv.org/html/2605.10980#S4.SS1.p1.1)\.
- J\. Li, J\. Guan, W\. Wu, and C\. Li \(2025a\)ReFusion: a diffusion large language model with parallel autoregressive decoding\.arXiv preprint arXiv:2512\.13586\.Cited by:[§2](https://arxiv.org/html/2605.10980#S2.p2.1)\.
- P\. Li, Y\. Zhou, D\. Muhtar, L\. Yin, S\. Yan, L\. Shen, Y\. Liang, S\. Vosoughi, and S\. Liu \(2025b\)Diffusion language models know the answer before decoding\.arXiv preprint arXiv:2508\.19982\.Cited by:[§2](https://arxiv.org/html/2605.10980#S2.p2.1)\.
- A\. Liu, M\. He, S\. Zeng, S\. Zhang, L\. Zhang, C\. Wu, W\. Jia, Y\. Liu, X\. Zhou, and J\. Zhou \(2025a\)WeDLM: reconciling diffusion language models with standard causal attention for fast inference\.arXiv preprint arXiv:2512\.22737\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p1.1),[§2](https://arxiv.org/html/2605.10980#S2.p2.1)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p1.1)\.
- J\. Liu, X\. Dong, Z\. Ye, R\. Mehta, Y\. Fu, V\. Singh, J\. Kautz, C\. Zhang, and P\. Molchanov \(2025b\)Tidar: think in diffusion, talk in autoregression\.arXiv preprint arXiv:2511\.08923\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p1.1)\.
- X\. Ma, R\. Yu, G\. Fang, and X\. Wang \(2025\)Dkv\-cache: the cache for diffusion language models\.arXiv preprint arXiv:2505\.15781\.Cited by:[§2](https://arxiv.org/html/2605.10980#S2.p2.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p1.1),[§2](https://arxiv.org/html/2605.10980#S2.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)Gpqa: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,Cited by:[§4\.1](https://arxiv.org/html/2605.10980#S4.SS1.p1.1)\.
- Y\. Song, Z\. Zhang, C\. Luo, P\. Gao, F\. Xia, H\. Luo, Z\. Li, Y\. Yang, H\. Yu, X\. Qu,et al\.\(2025\)Seed diffusion: a large\-scale diffusion language model with high\-speed inference\.arXiv preprint arXiv:2508\.02193\.Cited by:[§2](https://arxiv.org/html/2605.10980#S2.p1.1)\.
- X\. Wang, C\. Xu, Y\. Jin, J\. Jin, H\. Zhang, and Z\. Deng \(2025\)Diffusion llms can do faster\-than\-ar inference via discrete diffusion forcing\.arXiv preprint arXiv:2508\.09192\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p1.1),[§2](https://arxiv.org/html/2605.10980#S2.p2.1)\.
- C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie \(2025\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.arXiv preprint arXiv:2505\.22618\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p1.1),[§1](https://arxiv.org/html/2605.10980#S1.p2.1),[§1](https://arxiv.org/html/2605.10980#S1.p3.1),[§2](https://arxiv.org/html/2605.10980#S2.p2.1),[§3\.1\.2](https://arxiv.org/html/2605.10980#S3.SS1.SSS2.p1.4),[§3\.2](https://arxiv.org/html/2605.10980#S3.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.10980#S4.SS1.p2.1)\.
- C\. Xu, Y\. Jin, J\. Li, Y\. Tu, G\. Long, D\. Tu, M\. Song, H\. Si, T\. Hou, J\. Yan,et al\.\(2025\)Lopa: scaling dllm inference via lookahead parallel decoding\.arXiv preprint arXiv:2512\.16229\.Cited by:[§2](https://arxiv.org/html/2605.10980#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.10980#S4.SS1.p2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p1.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[§1](https://arxiv.org/html/2605.10980#S1.p1.1)\.

Similar Articles