Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM
Summary
This paper proposes Dynamic-dLLM, a training-free framework that accelerates diffusion large language models by dynamically allocating cache-update budgets and calibrating decoding thresholds, achieving over 3x speedup on models like LLaDA and Dream while maintaining performance.
View Cached Full Text
Cached at: 06/26/26, 05:14 AM
# Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM
Source: [https://arxiv.org/html/2606.26120](https://arxiv.org/html/2606.26120)
11footnotetext:Equal contribution\. The work is done during Tianyi’s internship at CBG Celia DeviceAI Team, Huawei\.33footnotetext:Corresponding author \(tianzhuotao@hit\.edu\.cn\)\.Tianyi Wu\*1Xiaoxi Sun\*1Yanhua Jiao1Yulin Li1 Yixin Chen2Yunhao Cao2Yiqi Hu2Zhuotao Tian‡\\ddagger1,3 1Harbin Institute of Technology, Shenzhen2Huawei3Shenzhen Loop Area Institute
###### Abstract
Diffusion Large Language Models \(dLLMs\) offer a promising alternative to autoregressive models, excelling in text generation tasks due to their bidirectional attention mechanisms\. However, their computational complexity, scaling as𝒪\(L3\)\\mathcal\{O\}\(L^\{3\}\)with sequence lengthLL, poses significant challenges for long\-sequence and real\-time applications, primarily due to the lack of compatibility with key\-value caching and the non\-autoregressive nature of denoising steps\. Existing acceleration methods rely on static caching or parallel decoding strategies, which fail to account for the dynamic behavior of token properties across layers and decoding steps\. We proposeDynamic\-dLLM, a training\-free framework that enhances dLLM inference efficiency through two components: Dynamic Cache Updating \(DCU\), which adaptively allocates cache\-update budgets based on layer\-wise token dynamics, and Adaptive Parallel Decoding \(APD\), which dynamically calibrates decoding thresholds to balance generation quality and efficiency\. Extensive experiments on models like LLaDA\-8B\-Instruct, LLaDA\-1\.5, and Dream\-v0\-7B\-Instruct across benchmarks such as MMLU, GSM8K, and HumanEval demonstrate that Dynamic\-dLLM significantly improves inference speed, attaining an average speedup of exceeding 3×\\timeswhile maintaining performance\. Dynamic\-dLLM outperforms state\-of\-the\-art acceleration methods and provides a plug\-and\-play solution for efficient dLLM deployment without compromising performance\. The code is available at https://github\.com/TianyiWu233/DYNAMIC\-DLLM\.
\(a\)LLaDA\-8B\-Instruct
\(b\)Dream\-v0\-7B\-Instruct
Figure 1:The comparison in terms of tokens\-per\-second \(TPS\)
## 1Introduction
Diffusion Large Language Models \(dLLMs\) have emerged as a compelling alternative to autoregressive models \(ARMS\), demonstrating strong performance in text generation tasks\. Notable examples such as LLaDA\(Nieet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib4); Zhuet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib5)\)and Dream\(Yeet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib10)\)highlight the rapid progress in this direction\. A key advantage of dLLMs lies in their bidirectional attention mechanisms, which enhance scalability and enable superior performance in handling complex scenarios, such as the “reversal curse”\([Berglundet al\.,](https://arxiv.org/html/2606.26120#bib.bib11)\), where traditional ARMs often struggle\. This allows dLLMs to capture richer contextual dependencies in challenging scenarios\.
However, despite their strong performance in certain domains, dLLMs face a fundamental challenge: their computational complexity scales as𝒪\(L3\)\\mathcal\{O\}\(L^\{3\}\)with respect to sequence lengthLL, significantly exceeding the𝒪\(L2\)\\mathcal\{O\}\(L^\{2\}\)cost of autoregressive models \(ARs\)\. This cubic scaling imposes a severe bottleneck for long\-sequence and real\-time generation tasks, limiting the practical deployability of dLLMs in latency\-sensitive applications\. The root cause lies in the non\-autoregressive nature of dLLMs, where each denoising step requires updating all tokens in parallel across the full sequence\. Besides, this paradigm hinders the caching of key\-value activations from previous steps, rendering dLLMs incompatible with the widely used KV\-Cache mechanism\.
#### Key observations\.
To address this issue, recent work has explored strategies for dLLM acceleration\. For example,\(Liuet al\.,[2025b](https://arxiv.org/html/2606.26120#bib.bib6); Maet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib8); Songet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib9)\)reduce redundancy by caching internal token representations across decoding steps\. Concurrently,\(Wuet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib7)\)accelerates inference by enabling parallel unmasking of multiple tokens within a single step\. These methods implicitly rely on specific token properties, such as feature stability and confidence, to identify opportunities for optimization\. However, they all rely on a static strategy across all layers and decoding steps, applying the same caching or unmasking criteria throughout the model and generation process, thus overlooking the dynamic nature of token behavior during generation\.
As illustrated in Figure[2](https://arxiv.org/html/2606.26120#S1.F2)\(a\-d\), the token properties vary across different layers and steps\. The frequency of changes in the internal features of tokens differs across layers, while the distributions of token confidence fluctuate across decoding steps\. The static strategies adopted by existing methods may fail to account for this dynamic behavior, leading to performance degradation\. Therefore, this observation prompts a critical question:how to design an adaptive method that dynamically aligns with the model’s intrinsic layer\-wise and step\-wise token dynamics to improve the efficiency?
\(a\)Layer input
\(b\)Attention output
\(c\)Layer input
\(d\)Attention output
\(e\)Wrong prediction using fixed threshold
Figure 2:\(a\-b\) Layer input similarity and attention output similarity across adjacent denoising steps\. The brighter region denotes a higher similarity, indicating most tokens are stable across steps\. \(c\-d\) The number of tokens requiring updates across different steps\. Differences across layers indicate varying demands for the token update budget\. \(e\) Existing parallel decoding methods may yield wrong predictions as potential candidates have been discarded by the fixed threshold\.
#### Our solution\.
In this work, we proposeDynamic\-dLLM, a training\-free framework for accelerating dLLM inference\. Dynamic\-dLLM consists of two key components: Dynamic Cache Updating \(DCU\) and Adaptive Parallel Decoding \(APD\)\.
Specifically, as tokens may exhibit heterogeneous dynamics across layers, instead of a static cache updating strategy across all layers, we propose Dynamic Cache Updating \(DCU\) that allocates cache\-update budgets adaptively, ensuring that layers requiring frequent updates are prioritized, while computational overhead is reduced in stable layers\. In addition, the existing parallel decoding strategy with fixed thresholds risks committing to tokens prematurely, as confidence estimates can shift over time, leading to error propagation\. To mitigate this, we introduce Adaptive Parallel Decoding \(APD\) that dynamically calibrates decoding thresholds by tracking the evolving distribution of prediction confidence, achieving a decent trade\-off between the degradation of generation quality caused by a low threshold and the inefficiency resulting from a high threshold\.
Extensive experiments across LLaDA\-8B\-Instruct, LLaDA\-1\.5, Dream\-v0\-7B\-Instruct, and benchmarks covering mathematics, science, coding, and general tasks demonstrate the effectiveness and strong generalization capabilities of the proposed method\. Notably, Dynamic\-dLLM achieves a maximum acceleration of up to 4\.48×, with an average speedup exceeding 3× while still maintaining performance, making it a plug\-and\-play training\-free solution for enhancing the efficiency of dLLMs without compromising performance\. In summary, our contributions are as follows:
- •In this study, we observe that the variations across layers and decoding steps of dLLM may undermine the effectiveness of existing static rule\-based acceleration methods\.
- •We propose Dynamic\-dLLM, a training\-free framework composed of Dynamic Cache Updating \(DCU\) and Adaptive Parallel Decoding \(APD\), DCU adaptively allocates cache\-update budgets across layers, while APD dynamically calibrates decoding thresholds across steps, jointly enabling efficient yet robust acceleration of dLLMs\.
- •Extensive experiments across diverse models and tasks show that Dynamic\-dLLM substantially improves inference efficiency while preserving the accuracy, outperforming state\-of\-the\-art acceleration methods\.
## 2Background and Motivation
### 2\.1Preliminaries of dLLM
In this section, we introduce preliminaries regarding the inference process of dLLM\(Nieet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib4)\)\. The introduction of related work is presented in the Appendix[D](https://arxiv.org/html/2606.26120#A4)due to the page limit\.
Given a prompt of lengthLpromptL\_\{\\text\{prompt\}\}tokens and a target generation length ofLgenL\_\{\\text\{gen\}\}tokens, letL=Lprompt\+LgenL=L\_\{\\text\{prompt\}\}\+L\_\{\\text\{gen\}\}\. The dLLM generates the output inTTiterative decoding steps, producing approximatelyLgen/TL\_\{\\text\{gen\}\}/Ttokens per step\. Let𝒱\\mathcal\{V\}denote the model’s vocabulary, and let\[MASK\]∈𝒱\[\\text\{MASK\}\]\\in\\mathcal\{V\}be a special placeholder token indicating positions to be predicted\. Denote by𝐱t∈𝒱L\\mathbf\{x\}^\{t\}\\in\\mathcal\{V\}^\{L\}the token sequence at steptt, wheret=T,T−1,…,0t=T,T\-1,\\dots,0\. The initial sequence is constructed as:
𝐱T=\(x0,…,xLprompt−1,\[MASK\],…,\[MASK\]\),\\mathbf\{x\}^\{T\}=\(x\_\{0\},\\dots,x\_\{L\_\{\\text\{prompt\}\}\-1\},\[\\text\{MASK\}\],\\dots,\[\\text\{MASK\}\]\),\(1\)wherexix\_\{i\}are the given prompt tokens\. At each steptt, the mask predictorfθf\_\{\\theta\}computes a distribution over the vocabulary for each position:
𝐳t=fθ\(𝐱t\)∈ℝL×\|𝒱\|\.\\mathbf\{z\}^\{t\}=f\_\{\\theta\}\(\\mathbf\{x\}^\{t\}\)\\in\\mathbb\{R\}^\{L\\times\|\\mathcal\{V\}\|\}\.\(2\)Using greedy decoding, we can obtain the most probable token at each masked position:
x^it=argmaxv∈𝒱\(Softmax\(𝐳it\)\)v,ifxit=\[MASK\]\.\\hat\{x\}\_\{i\}^\{t\}=\\underset\{v\\in\\mathcal\{V\}\}\{\\arg\\max\}\\left\(\\mathrm\{Softmax\}\(\\mathbf\{z\}\_\{i\}^\{t\}\)\\right\)\_\{v\},\\quad\\text\{if \}x\_\{i\}^\{t\}=\[\\text\{MASK\}\]\.\(3\)A transition functionSSthen updates the sequence to𝐱t−1\\mathbf\{x\}^\{t\-1\}by selectively replacing tokens based on confidence scores, re\-masking low\-confidence predictions to refine them in subsequent steps:𝐱t−1=S\(𝐱^t,𝐱t,t\)\.\\mathbf\{x\}^\{t\-1\}=S\(\\hat\{\\mathbf\{x\}\}^\{t\},\\mathbf\{x\}^\{t\},t\)\.The final output sequence𝐱0\\mathbf\{x\}^\{0\}is yielded whent=0t=0\.
### 2\.2Key Observations
Despite recent progress in accelerating diffusion\-style LLMs \(dLLMs\)\(Liuet al\.,[2025b](https://arxiv.org/html/2606.26120#bib.bib6); Wuet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib7); Maet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib8); Songet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib9)\), two critical inefficiencies remain unaddressed\.
#### Layer\-wise Cache Update Needs Vary Significantly\.
Existing methods exploit temporal redundancy by reusing cached intermediate features \(e\.g\., query, key, value, attention output, FFN output\) from the previous step for a subset of tokens, assuming high feature similarity across steps\. However, as illustrated in Figure[2](https://arxiv.org/html/2606.26120#S1.F2)\(a\-d\), the proportion of tokens requiring cache updates varies substantially across layers, increasing monotonically from shallow to deep layers\. This suggests that uniform or heuristic caching strategies are suboptimal\. Instead,a layer\-adaptive cache update policy is essential for dynamically allocating computation budgets where they matter most\.
#### Static Thresholding Hinders Parallel Decoding Efficacy\.
Parallel decoding strategy \(e\.g\.,Wuet al\.\([2025](https://arxiv.org/html/2606.26120#bib.bib7)\)\) unmask tokens once their confidence exceeds a fixed threshold\. Yet, as shown in Figure[2](https://arxiv.org/html/2606.26120#S1.F2)\(e\), the token with the highest confidence at an early step may not be the desired output and will be revised later, often replaced by its “runner\-up” prediction with the second\-highest confidence initially\. Conversely, tokens whose top prediction exhibits clear dominance over alternatives,i\.e\., low entropy or large margin, can be safely finalized earlier, even if absolute confidence remains below a static threshold\. Therefore, to enable earlier commitment to stable predictions, thereby expediting convergence without compromising accuracy,exploring the feasibility of a dynamic per\-token threshold, adjusting adaptively based on the predicted distribution \(e\.g\., entropy or probability margin\), becomes essential\.
## 3Method
Figure 3:Dynamic\-dLLM consists of two key components: Dynamic Cache Updating \(DCU, upper part\) and Adaptive Parallel Decoding \(APD, lower part\)\. DCU reallocates cache update budget for each layer at each step, while APD dynamically adjusts the decoding thresholds for all tokens\.To overcome the limitations of existing approaches, we propose Dynamic\-dLLM, a training\-free acceleration framework that dynamically optimizes dLLM inference along two dimensions: cache\-update management and parallel decoding scheduling\.
Regarding cache\-update management, we introduce a dynamic allocation mechanism for managing cache updates, recognizing the varying dynamics across layers\. This approach dynamically distributes the update budget among layers, prioritizing layers that require more frequent cache updates\. On the other hand, for optimizing the parallel decoding, we replace fixed confidence thresholds with an adaptive per\-token unmasking strategy, based on the predicted distribution of each token\. This strategy facilitates early commitment to confident predictions while postponing uncertain ones, achieving a more balanced trade\-off between speed and output quality\.
The overview is presented in Figure[3](https://arxiv.org/html/2606.26120#S3.F3)\. Sections[3\.1](https://arxiv.org/html/2606.26120#S3.SS1)and[3\.2](https://arxiv.org/html/2606.26120#S3.SS2)detail each component, respectively\.
### 3\.1Dynamic Cache Updating
Recent works\(Liuet al\.,[2025b](https://arxiv.org/html/2606.26120#bib.bib6); Maet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib8); Songet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib9)\)update a fixed or uniform number of token caches across all layers\. However, as demonstrated in Section[2\.2](https://arxiv.org/html/2606.26120#S2.SS2), the demand for cache updates varies significantly across layers\. This observation motivates the need for a dynamic allocation strategy that adapts the cache\-update budget per layer according to the specific requirement\.
In this section, we propose the Dynamic Cache Updating \(DCU\) strategy, which selectively updates only those tokens whose representations undergo significant changes between consecutive inference steps\. Prior work\(Liuet al\.,[2025b](https://arxiv.org/html/2606.26120#bib.bib6)\)identifies such tokens by measuring the cosine similarity between the current and cached Value vectors\. While effective, this approach incurs non\-negligible computational overhead due to the explicit recomputation and comparison of Value vectors\. Ideally, if token dynamics could be estimated without recomputing these vectors, cached values could be safely reused, thereby reducing redundancy\.
Inspired byLiuet al\.\([2025a](https://arxiv.org/html/2606.26120#bib.bib12)\), who observed a strong correlation between model inputs and outputs in diffusion transformers \(DiT\)\(Peebles and Xie,[2023a](https://arxiv.org/html/2606.26120#bib.bib13)\), we investigate the relationship between layer inputs and intermediate features in dLLMs\. As shown in Figure[4](https://arxiv.org/html/2606.26120#S3.F4), the features cached \(e\.g\., Key, Value, Attention Output, and FFN Output\) exhibit high correlation with the corresponding layer inputs\. This implies that changes in layer inputs across steps serve as a reliable proxy for the underlying dynamics of intermediate activations\. Consequently, input\-level differences can effectively inform cache\-update decisions without accessing or recomputing the cached features themselves\.
\(a\)Attention: 0\.94
\(b\)FFN: 0\.97
\(c\)Key: 0\.99
\(d\)Value: 0\.99
Figure 4:Spearman correlation values of layer inputs with intermediate features, including Key, Value, Attention Output, and FFN Output\. We visualized the cosine similarity between tokens’ feature vectors and their cached counterparts at adjacent steps, and compared the relationship between layer input and \(a\) Attention Output, \(b\) FFN Output, \(c\) Key, \(d\) Value\.#### Layer\-Adaptive Cache Budget Allocation\.
To dynamically allocate the cache update budget across layers, we first define a token\-level dissimilarity metric,dit,ld\_\{i\}^\{t,l\}, estimating the change in the representation of tokenxix\_\{i\}at layerllbetween consecutive inference stepsttandt\+1t\+1\. This metric is computed using the cosine distance between the normalized token inputs at the respective steps:
dit,l=1−\(𝐱it,l\)⊤𝐱it\+1,l‖𝐱it,l‖‖𝐱it\+1,l‖d\_\{i\}^\{t,l\}=1\-\\frac\{\(\\mathbf\{x\}\_\{i\}^\{t,l\}\)^\{\\top\}\\mathbf\{x\}\_\{i\}^\{t\+1,l\}\}\{\\\|\\mathbf\{x\}\_\{i\}^\{t,l\}\\\|\\\|\\mathbf\{x\}\_\{i\}^\{t\+1,l\}\\\|\}\(4\)A higher value ofdit,ld\_\{i\}^\{t,l\}denotes a greater change in the token’s representation, suggesting a higher need for cache update\. Then, we aggregate the token\-level variations into a layer\-wise metricst,ls^\{t,l\}\. This metric represents the average change in token representations within layerll:
st,l=1N∑i=0N−1dit,l,s^\{t,l\}=\\frac\{1\}\{N\}\\sum\_\{i=0\}^\{N\-1\}d\_\{i\}^\{t,l\},\(5\)whereNNis the sequence length\. Subsequently, the cache update budget for layerllat steptt, denoted asBlayert,lB^\{t,l\}\_\{\\text\{layer\}\}, is then allocated proportionally to its measured dynamism at the previous step \(t\+1t\+1\),st\+1,ls^\{t\+1,l\}\. This allocation is normalized across all layers using the total available budget,Blayer×LayerNumB\_\{\\text\{layer\}\}\\times\\texttt\{LayerNum\}:
Blayert,l=\(Blayer×LayerNum\)⋅st\+1,l∑k=0LayerNum−1st\+1,k\.B^\{t,l\}\_\{\\text\{layer\}\}=\(B\_\{\\text\{layer\}\}\\times\\texttt\{LayerNum\}\)\\cdot\\frac\{s^\{t\+1,l\}\}\{\\sum\_\{k=0\}^\{\\texttt\{LayerNum\}\-1\}s^\{t\+1,k\}\}\.\(6\)For each layerll, the set of tokens scheduled for cache update at steptt, denoted𝒰t,l\\mathcal\{U\}^\{t,l\}, is initialized as an empty set at the start of the step:𝒰t,l←∅\\mathcal\{U\}^\{t,l\}\\leftarrow\\emptyset\. Then, layerllidentifies the set𝒮t,l\\mathcal\{S\}^\{t,l\}comprising the top\-Blayert,lB^\{t,l\}\_\{\\text\{layer\}\}tokens with the highest variationdit,ld^\{t,l\}\_\{i\}\. These selected tokens are then added to the update set:𝒰t,l←𝒰t,l∪𝒮t,l\\mathcal\{U\}^\{t,l\}\\leftarrow\\mathcal\{U\}^\{t,l\}\\cup\\mathcal\{S\}^\{t,l\}\.
#### Token Stuck in the Mud\.
Nevertheless, the layer\-adaptive cache budget allocation strategy may potentially make some tokensstuck in the mud\. Specifically, if a tokenxix\_\{i\}is not selected for an update in layerll, its cached representation remains unchanged\. Consequently, its input to layerl\+1l\+1also remains static, leading to a zero variation scoredit,l\+1=0d\_\{i\}^\{t,l\+1\}=0for that layer\. As the allocation strategy prioritizes tokens with highdit,l\+1d\_\{i\}^\{t,l\+1\}, the tokenxix\_\{i\}will only be updated in layerl\+1l\+1if the number of tokens exhibiting non\-zero variation is insufficient to fill the allocated budgetBlayert,l\+1B\_\{\\text\{layer\}\}^\{t,l\+1\}\. Should this occur, and ifxix\_\{i\}is again not selected \(e\.g\., chosen randomly among the low\-priority tokens\), it will remain unchanged entering layerl\+2l\+2, perpetuating the cycle\. We refer to this phenomenon, where a token fails to be updated across multiple consecutive layers due to consistently low variation scores induced by prior missed updates, as a token becomingstuck in the mud\.
\(a\)Analysis of Distance
\(b\)Layer1515
\(c\)Layer2424
Figure 5:Local property analysis of dLLMs \(a\) Relationship between the distance from key token and the frequency of being decoded in the current step\. The closer the token is to the key token, the higher the probability of it being decoded\. \(b\) The last two images respectively represent the attention of response tokens to key token in layer1515and2424\. Red dot is the key token at this step\. The illustration shows that the tokens around the key token have higher attentions, which means that the changes caused by decoding the key token affect those tokens more than others\.
#### Mandatory Update Window\.
As illustrated in Figure[5](https://arxiv.org/html/2606.26120#S3.F5), there exists a spatial locality in the update pattern: tokens surrounding the one unmasked in the previous step \(thekey token\) are statistically more likely to be updated in the current step\. Let the position of the key token bepp\. To mitigate the risk of thenext key token\(the token with the highest confidence to be unmasked in the current step\) becomingstuck in the mud, we introduce aMandatory Update Window\. This mechanism ensures that a local region around the key token is always updated, regardless of the adaptive budget allocation\. Formally, we define a window of fixed sizeBwindowB\_\{\\text\{window\}\}centered on the key token’s positionpp\. The set of token positions covered by this window at a given step is\[p−Bwindow2,p\+Bwindow2\]\\left\[p\-\\frac\{B\_\{\\text\{window\}\}\}\{2\},p\+\\frac\{B\_\{\\text\{window\}\}\}\{2\}\\right\]\. For each layerll, the caches for all tokens within this window are compulsorily added to the layer’s update set𝒰t,l\\mathcal\{U\}^\{t,l\}:
𝒰t,l←𝒰t,l∪\{xi\|p−Bwindow2≤i≤p\+Bwindow2\}\.\\mathcal\{U\}^\{t,l\}\\leftarrow\\mathcal\{U\}^\{t,l\}\\cup\\left\\\{x\_\{i\}\\,\\middle\|\\,p\-\\frac\{B\_\{\\text\{window\}\}\}\{2\}\\leq i\\leq p\+\\frac\{B\_\{\\text\{window\}\}\}\{2\}\\right\\\}\.\(7\)This updated set𝒰t,l\\mathcal\{U\}^\{t,l\}then constitutes the final list of tokens whose caches will be recomputed for layerllin the current step\. By ensuring continuous updates within this local window, we reduce the likelihood of critical tokens being overlooked and retain the response to local changes\. The global budget is subsequently distributed adaptively among the remaining tokens based on the layer\-specific variation metricsst,ls^\{t,l\}for the following step\.
### 3\.2Adaptive Parallel Decoding
Section[2\.2](https://arxiv.org/html/2606.26120#S2.SS2)highlights that the peak confidence of a token can vary significantly across decoding steps in dLLMs\. This inherent dynamism poses a challenge for fixed\-threshold parallel decoding methods\(Wuet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib7)\), which rely on a static criterion and consequently suffer from decoding inaccuracies due to mispredictions at certain steps\.
To address this, we introduce theAdaptive Parallel Decodingmechanism that dynamically adjusts the masking threshold for each token based on its local prediction stability\. Each tokenxix\_\{i\}starts with an initial thresholdτiT\\tau^\{T\}\_\{i\}\. The threshold at steptt, denotedτit\\tau^\{t\}\_\{i\}, is adapted from the threshold used at the previous stept\+1t\+1,τit\+1\\tau^\{t\+1\}\_\{i\}\.
#### Adaptive Threshold via Confidence Concentration\.
The core idea is to modulate the threshold based on the concentration of the token’s predicted probability distribution\. Intuitively, a diffuse distribution \(a small gap between the highest and second\-highest probabilities\) suggests lower confidence in the current prediction, warranting a stricter \(higher\) threshold to reduce unnecessary updates\. Conversely, a concentrated distribution indicates stability, permitting a reduced threshold for early decoding\. Let𝐳it\\mathbf\{z\}\_\{i\}^\{t\}be the probability distribution over the vocabulary𝒱\\mathcal\{V\}for tokenxix\_\{i\}at steptt, the index of the most likely token is:u=argmaxv∈𝒱\(𝐳it\)v\.u=\\arg\\max\_\{v\\in\\mathcal\{V\}\}\\left\(\\mathbf\{z\}\_\{i\}^\{t\}\\right\)\_\{v\}\.Thus, the concentration of this distribution is quantified using the second\-highest probability score:
cit=1−maxv∈𝒱∖\{u\}\(𝐳it\)v\.c\_\{i\}^\{t\}=1\-\\max\_\{v\\in\\mathcal\{V\}\\setminus\\\{u\\\}\}\\left\(\\mathbf\{z\}\_\{i\}^\{t\}\\right\)\_\{v\}\.\(8\)A larger value ofcitc\_\{i\}^\{t\}signifies a more peaked and confident distribution\. Based on this measure, the decoding threshold for tokenxix\_\{i\}at stepttis adjusted as follows:
τit=τit\+1−α⋅cit,\\tau^\{t\}\_\{i\}=\\tau^\{t\+1\}\_\{i\}\-\\alpha\\cdot c\_\{i\}^\{t\},\(9\)whereα\\alphais a positive hyperparameter controlling the sensitivity of the threshold adaptation\. This formulation ensures that tokens with highly concentrated distributions \(largecitc\_\{i\}^\{t\}\) have their thresholds decreased, allowing for early decoding, while tokens with diffused distributions have increased thresholds to prevent decoding errors\.
#### Integration with Temporal Instability\.
In addition, the magnitude of historical shifts in a token’s confidence distribution provides a strong signal for its likelihood of future revision\. We quantify this shift via the cosine distance between the token’s confidence distributions at adjacent steps:
Hit=1−\(𝐳it\)⊤𝐳it\+1‖𝐳it‖‖𝐳it\+1‖\.H\_\{i\}^\{t\}=1\-\\frac\{\(\\mathbf\{z\}\_\{i\}^\{t\}\)^\{\\top\}\\mathbf\{z\}\_\{i\}^\{t\+1\}\}\{\\\|\\mathbf\{z\}\_\{i\}^\{t\}\\\|\\,\\\|\\mathbf\{z\}\_\{i\}^\{t\+1\}\\\|\}\.\(10\)A largerHitH\_\{i\}^\{t\}indicates greater instability in the prediction, suggesting that the token may still be undergoing refinement and thus warrants a stricter \(higher\) threshold to prevent early decoding\. By combiningcitc\_\{i\}^\{t\}andHitH\_\{i\}^\{t\}, the decoding threshold for tokenxix\_\{i\}at stepttis updated as:
τit=τit\+1−α⋅cit\+β⋅Hit,\\tau^\{t\}\_\{i\}=\\tau^\{t\+1\}\_\{i\}\-\\alpha\\cdot c\_\{i\}^\{t\}\+\\beta\\cdot H\_\{i\}^\{t\},\(11\)whereα,β≥0\\alpha,\\beta\\geq 0are hyperparameters balancing the influence of prediction confidence and temporal instability\.
Algorithms[1](https://arxiv.org/html/2606.26120#alg1)and[2](https://arxiv.org/html/2606.26120#alg2)in the Appendix outline the core mechanisms of Dynamic\-dLLM for accelerating dynamic LLMs \(dLLMs\) via Feature\-Caching and Parallel Decoding, respectively\. By explicitly accounting for dynamism along both the layer and step dimensions, Dynamic\-dLLM minimizes redundant computation and thereby significantly accelerates the inference process of dLLMs\.
Table 1:Results on LLaDA\-8B\-Instruct\(Nieet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib4)\)\. Each cell includes the accuracy, decoding throughput \(TPS\), with relative efficiency enhancement to the baseline\. Best values in bold, suboptimal values underlined\. Results with \* are obtained with parallel decoding\.
## 4Experiment
### 4\.1Experiment Settings
We assessed the performance of Dynamic\-dLLM using three typical dLLMs as baselines: LLaDA\-8B\-Instruct, LLaDA\-1\.5 and Dream\-7B\-Instruct\. If not otherwise specified, we defaultBlayerB\_\{\\text\{layer\}\}to 32, andBwindowB\_\{\\text\{window\}\}to 32\. More experimental details are shown in Appendix[B](https://arxiv.org/html/2606.26120#A2)\.
To comprehensively evaluate a model’s performance and efficiency, we employ two key metrics: accuracy on benchmarks and throughput, with the latter measured in Tokens Per Second \(TPS\)\. The benchmarks includes MMLU \(5\-shot\)\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.26120#bib.bib15)\), ARC\-challenge \(ARC\-c, 0\-shot\)\(Clarket al\.,[2018](https://arxiv.org/html/2606.26120#bib.bib16)\), GPQA \(5\-shot\)\(Reinet al\.,[2024](https://arxiv.org/html/2606.26120#bib.bib17)\), GSM8k \(4\-shot\)\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.26120#bib.bib14)\), and HumanEval \(HE, 0\-shot\)\(Chenet al\.,[2021](https://arxiv.org/html/2606.26120#bib.bib18)\)\. For fair comparison, we divided the methods into two groups, one using Feature\-Cache\(Liuet al\.,[2025b](https://arxiv.org/html/2606.26120#bib.bib6); Maet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib8); Wuet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib7)\)and the other using KV\-Cache and parallel decoding\(Wuet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib7)\)\. All experiments were performed on NVIDIA Pro6000 GPUs\.
Table 2:Results on LLaDA\-1\.5\(Zhuet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib5)\)\. Each cell includes the accuracy, decoding throughput \(TPS\), with relative efficiency enhancement to the baseline\. Best values in bold, suboptimal values underlined\. Results with \* are obtained with parallel decoding\.Table 3:Results on Dream\-v0\-7B\-Instruct\(Yeet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib10)\)\. Each cell includes the accuracy, decoding throughput \(TPS\), with relative efficiency enhancement to the baseline\. Best values in bold, suboptimal values underlined\. Results with \* are obtained with parallel decoding\.
### 4\.2Main Results
Our results \(baseline vs\. alternative methods vs\. our Dynamic\-dLLM\) are presented in Table[1](https://arxiv.org/html/2606.26120#S3.T1),[2](https://arxiv.org/html/2606.26120#S4.T2), and[3](https://arxiv.org/html/2606.26120#S4.T3)\. These results show that Dynamic\-dLLM not only achieves the most significant throughput improvement but also maintains performance\.
With only feature cache enabled, Dynamic\-dLLM delivers substantial speedups for high\-priority tasks without accuracy degradation\. It achieves notable throughput boosts on benchmarks with an average speedup of over 2\.5× across all evaluated tasks for LLaDA\-8B\-Instruct , while maintaining accuracy\. When combined with parallel decoding, Dynamic\-dLLM scales speedups\. For LLaDA\-8B\-Instruct on GSM8k, throughput hits 37\.29 TPS \(4\.48× faster than the baseline’s 8\.32 TPS\), with average speedup across tasks reaching 3\.21× and robust accuracy\.
This superiority persists across models: LLaDA\-1\.5 achieves 4\.46× speedup on GSM8k \(37\.02 vs\. 8\.30 TPS\) with near\-baseline accuracy \(60\.67% vs\. 61\.08%\); Dream\-v0\-7B\-Instruct gains 3\.91× speedup on GSM8k \(31\.48 vs\. 8\.05 TPS\)\. These cross\-model results demonstrate its generalization capabilities\.
\(a\)Ablation ofBlayerB\_\{\\text\{layer\}\}
\(b\)Ablation ofBwindowB\_\{\\text\{window\}\}
\(c\)Ablation of Threshold
Figure 6:Ablation studies on key hyperparameters, investigating the respective effects on the model’s performance \(accuracy\) and efficiency \(throughput\)\.
### 4\.3Ablation Studies
In this section, we present the ablation studies regarding the core designs of our method\.
Impact ofBlayerB\_\{\\text\{layer\}\}on Accuracy and Throughput\.As shown in Figure[6\(a\)](https://arxiv.org/html/2606.26120#S4.F6.sf1), we fix theBwindowB\_\{\\text\{window\}\}to 32 and do not use parallel decoding, and explore the impact ofBlayerB\_\{\\text\{layer\}\}on accuracy and throughput\. With the gradual increase ofBwindowB\_\{\\text\{window\}\}, the accuracy shows an upward trend, reaching a plateau around 32\. On the other hand, the throughput also rapidly decreases with the increase ofBwindowB\_\{\\text\{window\}\}\. Based on observations, a value of 32 forBwindowB\_\{\\text\{window\}\}is a more trade\-off choice\.
Impact ofBwindowB\_\{\\text\{window\}\}on Accuracy and Throughput\.Similarly, we discussed the impact ofBwindowB\_\{\\text\{window\}\}on accuracy and throughput in Figure[6\(b\)](https://arxiv.org/html/2606.26120#S4.F6.sf2)\.BwindowB\_\{\\text\{window\}\}is fixed to 32 and parallel decoding is disabled\. The impact ofBwindowB\_\{\\text\{window\}\}on accuracy and throughput is roughly the same as that ofBlayerB\_\{\\text\{layer\}\}, but the smallerBwindowB\_\{\\text\{window\}\}has a more severe reduction in accuracy thanBlayerB\_\{\\text\{layer\}\}\. To ensure that the accuracy is basically on par with the baseline, we have chosen 32 as the optimal value forBwindowB\_\{\\text\{window\}\}\.
Dynamic Threshold vs\. Fixed Threshold\. We discussed the difference between fixed threshold and dynamic threshold in Figure[6\(c\)](https://arxiv.org/html/2606.26120#S4.F6.sf3)\. The accuracy of both is the same under all initialization\. However, dynamic thresholds bring fewer inference steps than fixed thresholds in higher initialization\. With the maximum initialization of 0\.9, which does not excessively descend performance, dynamic thresholds can reduce inference steps by approximately 30% compared to the fixed thresholds\.
## 5Concluding Remarks
#### Summary\.
We present Dynamic\-dLLM, a training\-free framework for accelerating diffusion LLMs by adapting to the dynamic behavior of tokens across layers and decoding steps\. By introducing dynamic cache updating and adaptive parallel decoding, our method significantly reduces redundant computation while preserving generation quality\. Extensive experiments across various models and benchmarks covering mathematics, science, coding, and general tasks demonstrate the effectiveness and strong generalization capabilities of the proposed method\. In summary, Dynamic\-dLLM offers a general, plug\-and\-play solution toward efficient dLLM inference, highlighting the importance of adaptive strategies in non\-autoregressive generation\.
#### Limitation & Future Work\.
While Dynamic\-dLLM demonstrates strong performance across standard language generation benchmarks, its capabilities in multi\-modal understanding and complex reasoning scenarios remain largely unexplored\. In particular, the model’s current design is tailored to unimodal textual inputs, and it is unclear how its core mechanisms generalize to settings involving heterogeneous data modalities \(e\.g\., vision, audio, or structured knowledge\)\. Future work should investigate how these principles can be reformulated or extended to address the unique challenges of cross\-modal alignment, representation fusion, and modality\-specific computational demands\. Such extensions could unlock new avenues for building more flexible and efficient foundation models capable of robust reasoning across diverse input types\.
## Acknowledgement
This work was supported by the Guangdong Basic and Applied Basic Research Foundation \(2025A1515011546\) and by the Shenzhen Science and Technology Program \(JCYJ20240813105901003, KJZD20240903102901003, ZDCY20250901113000001\)\.
## References
- Structured denoising diffusion models in discrete state\-spaces\.Advances in neural information processing systems34,pp\. 17981–17993\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p2.1)\.
- \[2\]L\. Berglund, M\. Tong, M\. Kaufmann, M\. Balesni, A\. Stickland, T\. Korbak, and O\. EvansThe reversal curse: llms trained on “a is b” fail to learn “b is a”\. arxiv 2023\.arXiv preprint arXiv:2309\.12288\.Cited by:[§1](https://arxiv.org/html/2606.26120#S1.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§4\.1](https://arxiv.org/html/2606.26120#S4.SS1.p2.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§4\.1](https://arxiv.org/html/2606.26120#S4.SS1.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2606.26120#S4.SS1.p2.1)\.
- J\. Cui, S\. Liu, Z\. Tian, Z\. Zhong, and J\. Jia \(2022\)Reslt: residual learning for long\-tailed recognition\.IEEE transactions on pattern analysis and machine intelligence45\(3\),pp\. 3695–3706\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- J\. Cui, Z\. Zhong, Z\. Tian, S\. Liu, B\. Yu, and J\. Jia \(2023\)Generalized parametric contrastive learning\.IEEE Transactions on Pattern Analysis and Machine Intelligence46\(12\),pp\. 7463–7474\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§4\.1](https://arxiv.org/html/2606.26120#S4.SS1.p2.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p2.1)\.
- J\. Huang, X\. Hu, B\. Han, S\. Shi, Z\. Tian, T\. He, and L\. Jiang \(2025a\)Memory forcing: spatio\-temporal memory for consistent scene generation on minecraft\.arXiv preprint arXiv:2510\.03198\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- J\. Huang, X\. Hu, S\. Shi, Z\. Tian, and L\. Jiang \(2025b\)Edit360: 2d image edits to 3d assets from any angle\.InICCV,Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- L\. Jiang, S\. Shi, Z\. Tian, X\. Lai, S\. Liu, C\. Fu, and J\. Jia \(2021\)Guided point contrastive learning for semi\-supervised point cloud semantic segmentation\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 6423–6432\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- X\. Lai, Z\. Tian, Y\. Chen, Y\. Li, Y\. Yuan, S\. Liu, and J\. Jia \(2024a\)Lisa: reasoning segmentation via large language model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9579–9589\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- X\. Lai, Z\. Tian, Y\. Chen, S\. Yang, X\. Peng, and J\. Jia \(2024b\)Step\-dpo: step\-wise preference optimization for long\-chain reasoning of llms\.arXiv preprint arXiv:2406\.18629\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- X\. Lai, Z\. Tian, L\. Jiang, S\. Liu, H\. Zhao, L\. Wang, and J\. Jia \(2021\)Semi\-supervised semantic segmentation with directional context\-aware consistency\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 1205–1214\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- Y\. Li, Z\. Liu, Z\. Li, X\. Zhang, Z\. Xu, X\. Chen, H\. Shi, S\. Jiang, X\. Wang, J\. Wang,et al\.\(2025\)Perception, reason, think, and plan: a survey on large multimodal reasoning models\.arXiv preprint arXiv:2505\.04921\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- F\. Liu, S\. Zhang, X\. Wang, Y\. Wei, H\. Qiu, Y\. Zhao, Y\. Zhang, Q\. Ye, and F\. Wan \(2025a\)Timestep embedding tells: it’s time to cache for video diffusion model\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 7353–7363\.Cited by:[§3\.1](https://arxiv.org/html/2606.26120#S3.SS1.p3.1)\.
- Z\. Liu, Y\. Yang, Y\. Zhang, J\. Chen, C\. Zou, Q\. Wei, S\. Wang, and L\. Zhang \(2025b\)Dllm\-cache: accelerating diffusion large language models with adaptive caching\.arXiv preprint arXiv:2506\.06295\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p3.1),[§1](https://arxiv.org/html/2606.26120#S1.SS0.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2606.26120#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2606.26120#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.26120#S3.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.26120#S4.SS1.p2.1)\.
- A\. Lou, C\. Meng, and S\. Ermon \(2023\)Discrete diffusion language modeling by estimating the ratios of the data distribution\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p2.1)\.
- X\. Luo, Z\. Tian, T\. Zhang, B\. Yu, Y\. Y\. Tang, and J\. Jia \(2023\)Pfenet\+\+: boosting few\-shot semantic segmentation with the noise\-filtered context\-aware prior mask\.IEEE Transactions on Pattern Analysis and Machine Intelligence46\(2\),pp\. 1273–1289\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- X\. Ma, R\. Yu, G\. Fang, and X\. Wang \(2025\)Dkv\-cache: the cache for diffusion language models\.arXiv preprint arXiv:2505\.15781\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p3.1),[§1](https://arxiv.org/html/2606.26120#S1.SS0.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2606.26120#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2606.26120#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.26120#S4.SS1.p2.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p2.1),[§1](https://arxiv.org/html/2606.26120#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26120#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.26120#S3.T1)\.
- Z\. Ning, Z\. Tian, G\. Lu, and W\. Pei \(2023\)Boosting few\-shot 3d point cloud segmentation via query\-guided enhancement\.InProceedings of the 31st ACM international conference on multimedia,pp\. 1895–1904\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- W\. Peebles and S\. Xie \(2023a\)Scalable diffusion models with transformers\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 4195–4205\.Cited by:[§3\.1](https://arxiv.org/html/2606.26120#S3.SS1.p3.1)\.
- W\. Peebles and S\. Xie \(2023b\)Scalable diffusion models with transformers\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 4195–4205\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p2.1)\.
- B\. Peng, Z\. Tian, S\. Liu, M\. Yang, and J\. Jia \(2024a\)Scalable language model with generalized continual learning\.arXiv preprint arXiv:2404\.07470\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- B\. Peng, Z\. Tian, X\. Wu, C\. Wang, S\. Liu, J\. Su, and J\. Jia \(2023\)Hierarchical dense correlation distillation for few\-shot segmentation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 23641–23651\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- B\. Peng, X\. Wu, L\. Jiang, Y\. Chen, H\. Zhao, Z\. Tian, and J\. Jia \(2024b\)Oa\-cnns: omni\-adaptive sparse cnns for 3d semantic segmentation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 21305–21315\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- S\. Peng, S\. Yang, L\. Jiang, and Z\. Tian \(2025\)Mitigating object hallucinations via sentence\-level early intervention\.arXiv preprint arXiv:2507\.12455\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)Gpqa: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,Cited by:[§4\.1](https://arxiv.org/html/2606.26120#S4.SS1.p2.1)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10684–10695\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p2.1)\.
- T\. Shao, Z\. Tian, H\. Zhao, and J\. Su \(2024\)Explore the potential of clip for training\-free open vocabulary semantic segmentation\.InEuropean Conference on Computer Vision,pp\. 139–156\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- J\. Sohl\-Dickstein, E\. A\. Weiss, N\. Maheswaranathan, and S\. Ganguli \(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.JMLR\.org\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p2.1)\.
- Y\. Song, X\. Liu, R\. Li, Z\. Liu, Z\. Huang, Q\. Guo, Z\. He, and X\. Qiu \(2025\)Sparse\-dllm: accelerating diffusion llms with dynamic cache eviction\.arXiv preprint arXiv:2508\.02558\.Cited by:[§1](https://arxiv.org/html/2606.26120#S1.SS0.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2606.26120#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2606.26120#S3.SS1.p1.1)\.
- Z\. Tian, P\. Chen, X\. Lai, L\. Jiang, S\. Liu, H\. Zhao, B\. Yu, M\. Yang, and J\. Jia \(2022\)Adaptive perspective distillation for semantic segmentation\.IEEE Transactions on Pattern Analysis and Machine Intelligence45\(2\),pp\. 1372–1387\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- Z\. Tian, J\. Cui, L\. Jiang, X\. Qi, X\. Lai, Y\. Chen, S\. Liu, and J\. Jia \(2023\)Learning context\-aware classifier for semantic segmentation\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 2438–2446\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- Z\. Tian, H\. Zhao, M\. Shu, Z\. Yang, R\. Li, and J\. Jia \(2020\)Prior guided feature enrichment network for few\-shot segmentation\.IEEE transactions on pattern analysis and machine intelligence44\(2\),pp\. 1050–1065\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- C\. Wang, L\. Jiang, X\. Wu, Z\. Tian, B\. Peng, H\. Zhao, and J\. Jia \(2024\)Groupcontrast: semantic\-aware self\-supervised representation learning for 3d understanding\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 4917–4928\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- J\. Wang, B\. Chen, Y\. Li, B\. Kang, Y\. Chen, and Z\. Tian \(2025a\)Declip: decoupled learning for open\-vocabulary dense perception\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 14824–14834\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- J\. Wang, K\. Chen, Y\. Li, B\. Chen, H\. Zhao, X\. Qi, and Z\. Tian \(2025b\)Generalized decoupled learning for enhancing open\-vocabulary dense perception\.arXiv preprint arXiv:2508\.11256\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie \(2025\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.arXiv preprint arXiv:2505\.22618\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p3.1),[§1](https://arxiv.org/html/2606.26120#S1.SS0.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2606.26120#S2.SS2.SSS0.Px2.p1.1),[§2\.2](https://arxiv.org/html/2606.26120#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.26120#S3.SS2.p1.1),[§4\.1](https://arxiv.org/html/2606.26120#S4.SS1.p2.1)\.
- S\. Yang, Y\. Chen, Z\. Tian, C\. Wang, J\. Li, B\. Yu, and J\. Jia \(2025\)Visionzip: longer is better but not necessary in vision language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 19792–19802\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- S\. Yang, T\. Qu, X\. Lai, Z\. Tian, B\. Peng, S\. Liu, and J\. Jia \(2023\)Lisa\+\+: an improved baseline for reasoning segmentation with large language model\.arXiv preprint arXiv:2312\.17240\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- S\. Yang, Z\. Tian, L\. Jiang, and J\. Jia \(2024\)Unified language\-driven zero\-shot domain adaptation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 23407–23415\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p2.1),[§1](https://arxiv.org/html/2606.26120#S1.p1.1),[Table 3](https://arxiv.org/html/2606.26120#S4.T3)\.
- Y\. Zhang, X\. Wu, Y\. Lao, C\. Wang, Z\. Tian, N\. Wang, and H\. Zhao \(2025\)Concerto: joint 2d\-3d self\-supervised learning emerges spatial representations\.InNeurIPS,Cited by:[Appendix D](https://arxiv.org/html/2606.26120#A4.p1.1)\.
- F\. Zhu, R\. Wang, S\. Nie, X\. Zhang, C\. Wu, J\. Hu, J\. Zhou, J\. Chen, Y\. Lin, J\. Wen,et al\.\(2025\)LLaDA 1\.5: variance\-reduced preference optimization for large language diffusion models\.arXiv preprint arXiv:2505\.19223\.Cited by:[§1](https://arxiv.org/html/2606.26120#S1.p1.1),[Table 2](https://arxiv.org/html/2606.26120#S4.T2)\.
###### Contents
1. [References](https://arxiv.org/html/2606.26120#bib)
2. [AAlgorithm Supplement](https://arxiv.org/html/2606.26120#A1)
3. [BExperiment Details](https://arxiv.org/html/2606.26120#A2)
4. [CExample Description](https://arxiv.org/html/2606.26120#A3)
5. [DRelated Work](https://arxiv.org/html/2606.26120#A4)
6. [EExperimental Supplements](https://arxiv.org/html/2606.26120#A5)
## Appendix AAlgorithm Supplement
The pseudocode of Dynamic\-dLLM is shown in Algorithm[1](https://arxiv.org/html/2606.26120#alg1)and[2](https://arxiv.org/html/2606.26120#alg2)\.
Algorithm 1Dynamic Cache Updating1:Mask predictor
fθf\_\{\\theta\}, prompt
𝐜\\mathbf\{c\}and initial masked sequence
𝐱T\\mathbf\{x\}^\{T\}with length
LL, denoising steps
TT, cache update budget
BwindowB\_\{\\text\{window\}\}and
BlayerB\_\{\\text\{layer\}\}, initial threshold
τT\\tau^\{T\}\.
2:Final prediction
𝐱0\\mathbf\{x\}^\{0\}⊳\\triangleright/\* Initialize caches at stept=Tt=T\*/
3:
C←InitializeCache\(L,𝐱T\)C\\leftarrow\\text\{InitializeCache\}\(L,\\mathbf\{x\}^\{T\}\)⊳\\trianglerightCache Key, Value, Attention Output and FFN Output ofLLtokens\.
4:Generate prediction
𝐱^T\\hat\{\\mathbf\{x\}\}^\{T\}using model
fθf\_\{\\theta\}⊳\\trianglerightNeeds initial pass or separate handling
5:
𝐱T−1←𝒮\(𝐱^T,𝐱T,T\)\\mathbf\{x\}^\{T\-1\}\\leftarrow\\mathcal\{S\}\(\\hat\{\\mathbf\{x\}\}^\{T\},\\mathbf\{x\}^\{T\},T\)
6:for
t=T−1t=T\-1down to
11do
7:
𝐱layer\_in←𝐱t\\mathbf\{x\}\_\{\\text\{layer\\\_in\}\}\\leftarrow\\mathbf\{x\}^\{t\}⊳\\trianglerightInitial input for layer0at steptt
8:for
llin each layer in the Transformer networkdo
9:
𝐱t,l←LayerNorm\(𝐱layer\_inl\)\\mathbf\{x\}^\{t,l\}\\leftarrow\\mathrm\{Layer\}\\mathrm\{Norm\}\(\\mathbf\{x\}\_\{layer\\\_in\}^\{l\}\)
10:
𝒰t,l←∅\\mathcal\{U\}^\{t,l\}\\leftarrow\\emptyset
11:
𝒰t,l←𝒰t,l=𝒰t,l∪\{xi∣p−Bwindow2≤i≤p\+Bwindow2\}\\mathcal\{U\}^\{t,l\}\\leftarrow\\mathcal\{U\}^\{t,l\}=\\mathcal\{U\}^\{t,l\}\\cup\\left\\\{x\_\{i\}\\mid p\-\\frac\{B\_\{\\text\{window\}\}\}\{2\}\\leq i\\leq p\+\\frac\{B\_\{\\text\{window\}\}\}\{2\}\\right\\\}⊳\\triangleright/\*ppis the position of token unmasked in last step \*/
12:foreach token
jjin sequencedo
13:
djt,l=1−\(𝐱jt,l\)⊤𝐱jt\+1,l‖𝐱jt,l‖‖𝐱jt\+1,l‖d\_\{j\}^\{t,l\}=1\-\\frac\{\(\\mathbf\{x\}\_\{j\}^\{t,l\}\)^\{\\top\}\\mathbf\{x\}\_\{j\}^\{t\+1,l\}\}\{\\\|\\mathbf\{x\}\_\{j\}^\{t,l\}\\\|\\\|\\mathbf\{x\}\_\{j\}^\{t\+1,l\}\\\|\}
14:endfor
15:
st,l←Mean\(d0t,l,d1t,l,…,dL−1t,l\)s^\{t,l\}\\leftarrow\\text\{Mean\}\\left\(d^\{t,l\}\_\{0\},d^\{t,l\}\_\{1\},\\ldots,d^\{t,l\}\_\{L\-1\}\\right\)
16:
Blayert,l←\(Blayer×LayerNum\)⋅st\+1,l∑k=0LayerNum−1st\+1,kB^\{t,l\}\_\{\\text\{layer\}\}\\leftarrow\(B\_\{\\text\{layer\}\}\\times LayerNum\)\\cdot\\frac\{s^\{t\+1,l\}\}\{\\sum\_\{k=0\}^\{LayerNum\-1\}s^\{t\+1,k\}\}
17:
𝒰t,l←𝒰t,l∪\\mathcal\{U\}^\{t,l\}\\leftarrow\\mathcal\{U\}^\{t,l\}\\cupindices of
Blayert,lB^\{t,l\}\_\{\\text\{layer\}\}tokens with heightest
djt,ld\_\{j\}^\{t,l\}
18:
𝐱layer\_out,C←RefreshCache\(𝐱normt,l,C,𝒰t,l\)\\mathbf\{x\}\_\{\\text\{layer\\\_out\}\},C\\leftarrow\\text\{RefreshCache\}\(\\mathbf\{x\}\_\{\\text\{norm\}\}^\{t\},l,C,\\mathcal\{U\}^\{t,l\}\)
19:
𝐱layer\_in←𝐱layer\_out\\mathbf\{x\}\_\{\\text\{layer\\\_in\}\}\\leftarrow\\mathbf\{x\}\_\{\\text\{layer\\\_out\}\}⊳\\trianglerightUpdate input for the next layer
20:endfor⊳\\trianglerightEnd layer loop
21:Generate prediction
𝐳t\{\\mathbf\{z\}\}^\{t\}using final layer output
𝐱layer\_out\\mathbf\{x\}\_\{\\text\{layer\\\_out\}\}
22:
𝐱t−1,τt−1←ParallelDecoding\(𝐳t,𝐱t,τt\\mathbf\{x\}^\{t\-1\},\\tau^\{t\-1\}\\leftarrow\\text\{ParallelDecoding\}\(\{\\mathbf\{z\}\}^\{t\},\\mathbf\{x\}^\{t\},\\tau^\{t\}\)⊳\\trianglerightAdaptive Parallel Decoding shown in[2](https://arxiv.org/html/2606.26120#alg2)
23:ifall
𝐱t−1\\mathbf\{x\}^\{t\-1\}unmaskedthen
24:break
25:endif
26:endfor⊳\\trianglerightEnd step loop
27:returnfinal prediction
𝐱0\\mathbf\{x\}^\{0\}
Algorithm 2Adaptive Parallel Decoding1:Prediction
𝐳t\\mathbf\{z\}^\{t\}, parameters
α,β\\alpha,\\beta, initial threshold
τnT\\tau^\{T\}\_\{n\}for every masked token
nn, masked sequence
𝐱t\\mathbf\{x\}^\{t\}
2:
𝐩t←Softmax\(𝐳t\)\\mathbf\{p\}^\{t\}\\leftarrow\\text\{Softmax\}\(\\mathbf\{z\}^\{t\}\)⊳\\trianglerightProbability distribution over𝒱\\mathcal\{V\}
3:for
n=0n=0to
L−1L\-1do
4:
𝐩sorted←Sort\(𝐩nt,descending\)\\mathbf\{p\}\_\{\\text\{sorted\}\}\\leftarrow\\text\{Sort\}\(\\mathbf\{p\}\_\{n\}^\{t\},\\text\{descending\}\)
5:
cit←1−𝐩sorted\[1\]c\_\{i\}^\{t\}\\leftarrow 1\-\\mathbf\{p\}\_\{\\text\{sorted\}\}\[1\]⊳\\triangleright/\* Calculate peak concentration \*/
6:endfor⊳\\trianglerightEnd token loop⊳\\triangleright/\* Calculate confidence fluctuation \(for adjacent steps\) \*/
7:if
t<T−1t<T\-1then
8:for
i=0i=0to
L−1L\-1do
9:
Hit←1−\(𝐳it\)⊤𝐳it\+1‖𝐳it‖‖𝐳it\+1‖H\_\{i\}^\{t\}\\leftarrow 1\-\\frac\{\(\\mathbf\{z\}\_\{i\}^\{t\}\)^\{\\top\}\\mathbf\{z\}\_\{i\}^\{t\+1\}\}\{\\\|\\mathbf\{z\}\_\{i\}^\{t\}\\\|\\\|\\mathbf\{z\}\_\{i\}^\{t\+1\}\\\|\}⊳\\trianglerightCosine similarity
10:endfor
11:endif⊳\\triangleright/\* Adaptive threshold adjustment \*/
12:if
t<T−1t<T\-1then
13:
τit=τit\+1−α⋅cit\+β⋅Hit\\tau^\{t\}\_\{i\}=\\tau^\{t\+1\}\_\{i\}\-\\alpha\\cdot c\_\{i\}^\{t\}\+\\beta\\cdot H\_\{i\}^\{t\}
14:endif
15:Unmask all
iiwith
pit≥τtp\_\{i\}^\{t\}\\geq\\tau^\{t\}, always unmask
pitp\_\{i\}^\{t\}
16:return
𝐱t−1\\mathbf\{x\}\_\{t\-1\}
## Appendix BExperiment Details
### B\.1Benchmarks and Settings
Table[4](https://arxiv.org/html/2606.26120#A2.T4)shows the specific setups for each benchmark, involving the count of decoding steps, block length, and generation length\. The benchmarks encompass MMLU \(5\-shot\), ARC\-C \(0\-shot\), GSM8K \(4\-shot\), Math \(4\-shot\), and HumanEval \(0\-shot\)\. To examine the generalization and robustness of diverse approaches, we reduce task\-dependent hyperparameter adjustments and instead use a uniform setup for all benchmarks except HumanEval\. Owing to its unique task characteristic, HumanEval demands a greater number of decoding steps and a longer generation length\.
### B\.2Implementation Details
We offer a thorough explanation of the parameter setups for the compared methods dKV\-Cache and dLLM\-Cache across various models\. According to the suggested configurations in the dKV\-Cache paper, we set the cache update interval to 8 for the LLaDA series and to 4 for the Dream series\.
For dLLM\-Cache, the paper presents multiple parameter configurations, whereKpK\_\{p\}denotes the prompt refresh interval andKrK\_\{r\}represents the response refresh interval\. specifically, For LLaDA\-8B\-Instruct:Kp=50K\_\{p\}=50,Kr=7K\_\{r\}=7; For LLaDA\-1\.5:Kp=100K\_\{p\}=100,Kr=6K\_\{r\}=6; For Dream\-v0\-7B\-Instruct:Kp=50K\_\{p\}=50,Kr=2K\_\{r\}=2\.
In addition, for Adaptive Parallel Decoding \(APD\) in Dynamic\-dLLM, we setα=0\.001\\alpha=0\.001andβ=0\.0008\\beta=0\.0008based on extensive statistical analysis\.
Table 4:Configuration of Benchmarks
## Appendix CExample Description
As shown in Figure[7](https://arxiv.org/html/2606.26120#A3.F7), in the absence of candidate, a fixed threshold can actually hinder early decoding of correct predictions, while Adaptive Parallel Decoding monitors the status of candidates in real time and ends unnecessary steps early\.
Figure 7:Fixed threshold may hinder the early output of correct predictions, as shown in the figure\. The correctly predicted “good” cannot be decoded until its confidence exceeds the threshold0\.80\.8, which wastesnnsteps\.
## Appendix DRelated Work
Large Language Models\.Driven by transformer\-based architectures and large\-scale pretraining, Large Language Models \(LLMs\) have achieved remarkable success, demonstrating exceptional capabilities\(Tianet al\.,[2020](https://arxiv.org/html/2606.26120#bib.bib27); Laiet al\.,[2021](https://arxiv.org/html/2606.26120#bib.bib28); Jianget al\.,[2021](https://arxiv.org/html/2606.26120#bib.bib31); Penget al\.,[2023](https://arxiv.org/html/2606.26120#bib.bib32); Cuiet al\.,[2023](https://arxiv.org/html/2606.26120#bib.bib35); Luoet al\.,[2023](https://arxiv.org/html/2606.26120#bib.bib37); Shaoet al\.,[2024](https://arxiv.org/html/2606.26120#bib.bib42); Penget al\.,[2024a](https://arxiv.org/html/2606.26120#bib.bib45)\)\. While these models primarily excel in textual processing\(Tianet al\.,[2022](https://arxiv.org/html/2606.26120#bib.bib40);[2023](https://arxiv.org/html/2606.26120#bib.bib41); Ninget al\.,[2023](https://arxiv.org/html/2606.26120#bib.bib47); Wanget al\.,[2024](https://arxiv.org/html/2606.26120#bib.bib46)\), their robust architectural foundation has paved the way for various functional extensions\(Cuiet al\.,[2022](https://arxiv.org/html/2606.26120#bib.bib30);[2023](https://arxiv.org/html/2606.26120#bib.bib35); Penget al\.,[2024b](https://arxiv.org/html/2606.26120#bib.bib39); Yanget al\.,[2024](https://arxiv.org/html/2606.26120#bib.bib43); Wanget al\.,[2024](https://arxiv.org/html/2606.26120#bib.bib46)\), such as semantic segmentation and object detection\(Yanget al\.,[2024](https://arxiv.org/html/2606.26120#bib.bib43);[2025](https://arxiv.org/html/2606.26120#bib.bib36); Laiet al\.,[2024b](https://arxiv.org/html/2606.26120#bib.bib33); Penget al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib48); Wanget al\.,[2025b](https://arxiv.org/html/2606.26120#bib.bib50); Huanget al\.,[2025a](https://arxiv.org/html/2606.26120#bib.bib51)\)\. One notable derivative branch is the development of Multimodal Large Language Models \(MLLMs\)\(Wanget al\.,[2025a](https://arxiv.org/html/2606.26120#bib.bib49); Liet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib44); Zhanget al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib53); Huanget al\.,[2025b](https://arxiv.org/html/2606.26120#bib.bib54)\), which expand the core LLM utility by integrating modality encoders to process inputs beyond text, such as image, audio, and video\. Through this extension, the reasoning power of the central LLM backbone has been adapted to assist in other domains, including traditional computer vision tasks, as illustrated by specific applications like LISA\(Laiet al\.,[2024a](https://arxiv.org/html/2606.26120#bib.bib26)\)and LISA\+\+\(Yanget al\.,[2023](https://arxiv.org/html/2606.26120#bib.bib38)\)in reasoning segmentation\.
Diffusion Large Language Models\.Diffusion models, which excel in continuous data generation through iterative denoising processes\(Sohl\-Dicksteinet al\.,[2015](https://arxiv.org/html/2606.26120#bib.bib19); Hoet al\.,[2020](https://arxiv.org/html/2606.26120#bib.bib20)\), have recently shown promising potential in natural language processing\. Unlike their success in image domains\(Rombachet al\.,[2022](https://arxiv.org/html/2606.26120#bib.bib22); Peebles and Xie,[2023b](https://arxiv.org/html/2606.26120#bib.bib23)\), adapting these models to text generation faces fundamental challenges arising from the discrete token space and sequential dependencies inherent in language\. A key advancement in addressing these issues comes from discrete diffusion frameworks, particularly Masked Diffusion Models \(MDMs\) that operate by progressively refining sequences through context\-aware mask prediction\(Austinet al\.,[2021](https://arxiv.org/html/2606.26120#bib.bib24); Louet al\.,[2023](https://arxiv.org/html/2606.26120#bib.bib25)\)\. Recent methodological innovations have significantly expanded the capabilities of diffusion\-based language models\. Scaling efforts have produced foundation models like LLaDA\(Nieet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib4)\), an 8B parameter bidirectional Transformer trained from scratch, and Dream\(Yeet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib10)\)which leverages pre\-trained autoregressive weights, both achieving performance parity with similarly sized autoregressive models\.
Inference acceleration methods of dLLMs\.Multiple studies have investigated strategies for speeding up discrete diffusion large language models \(dLLMs\)\. Some studies using feature caching cut down on computational costs by caching the internal features of tokens across different diffusion steps\. dLLM\-Cache\(Liuet al\.,[2025b](https://arxiv.org/html/2606.26120#bib.bib6)\)selects a fixed proportion of tokens for cache update for each layer by sorting the cosine similarity of Value vector between adjacent steps\. dKV\-Cache\(Maet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib8)\)puts the tokens decoded at each step into the cache and doeses not update them in later steps\. Fast\-dLLM\(Wuet al\.,[2025](https://arxiv.org/html/2606.26120#bib.bib7)\)puts tokens outside the current block to the cache and updates tokens within the current block\. The similarity of these methods is that they adopt the same cache update strategy for all layers, which ignores the different requirements of each layer\. In addition, Fast\-dLLM proposes the parallel decoding strategy, unmasking the tokens with confidence exceeding a predetermined threshold, which has difficulty balancing quality and efficiency\.
## Appendix EExperimental Supplements
### E\.1Stability Analysis of Hyperparameters
Stability Analysis ofBlayerB\_\{layer\}andBwindowB\_\{window\}\.To investigate the stability of parameter settings across diverse scenarios, we conducted additional experiments\. As shown in Table[5](https://arxiv.org/html/2606.26120#A5.T5), when the sequence length exceeds 1k, the accuracy of the default settings \(Blayer=32B\_\{layer\}=32,Bwindow=32B\_\{window\}=32\) exhibits a slight decline\. However, with appropriate increases in these two parameters, the accuracy gradually recovers\. Furthermore, when the sum ofBlayerB\_\{layer\}andBwindowB\_\{window\}is fixed, different proportional allocations between them lead to varying impacts on performance\. Through these experiments, we confirmed that settingBlayerB\_\{layer\}equal toBwindowB\_\{window\}yields optimal results\. To ensure the method’s stability across different generation lengths, we propose an auto\-tuning strategy:Blayer=Bwindow=18×gen\_lenB\_\{layer\}=B\_\{window\}=\\frac\{1\}\{8\}\\times gen\\\_len\(wheregen\_lengen\\\_lendenotes the generation length\)\.
Stability Analysis ofα\\alphaandβ\\beta\.For the hyperparametersα=0\.001\\alpha=0\.001andβ=0\.0008\\beta=0\.0008, we performed experiments under different output length settings \(256, 512, and 1024\), while maintainingBwindow=18×gen\_lenB\_\{window\}=\\frac\{1\}\{8\}\\times gen\\\_lenandBlayer=18×gen\_lenB\_\{layer\}=\\frac\{1\}\{8\}\\times gen\\\_len\(see Table[6](https://arxiv.org/html/2606.26120#A5.T6)\)\. The results indicate that there is almost no degradation in accuracy across these different generation lengths, verifying the strong stability ofα\\alphaandβ\\beta\.
Table 5:Performance of DCU with different settings ofBwindowB\_\{\{window\}\}andBlayerB\_\{\{layer\}\}, using 1024 generated tokens on the GSM8K dataset with the LLaDA\-8B\-Instruct model\.Table 6:Performance scores under different generation lengths withα=0\.001,β=0\.0008\\alpha=0\.001,\\beta=0\.0008As per Equation[10](https://arxiv.org/html/2606.26120#S3.E10),α\\alphaandβ\\betaregulate threshold adaptation to prediction confidence and temporal instability, respectively\. We conducted ablation experiments on GSM8K \(gen\_lengen\\\_len=256\), with results in Table[7](https://arxiv.org/html/2606.26120#A5.T7)\(score=accuracy; NIS=Number of Inference Steps, fewer = faster\)\. The optimal range forα/β\\alpha/\\betais the10−310^\{\-3\}order of magnitude\. The model is sensitive to order\-of\-magnitude scaling \(severe quality loss with over\-scaling\) but stable to small variations within this range\.
Table 7:Performance scores and Number of Inference Steps \(NIS\) under different combinations ofα\\alphaandβ\\beta, with 256 generate length on GSM8K
### E\.2Analysis of Additional Latency
We present comprehensive metrics in Table[8](https://arxiv.org/html/2606.26120#A5.T8), which details the performance scores \(accuracy\), average single\-inference latency, and additional latency introduced by DCU and APD separately—all evaluated on the GSM8K dataset across different generation lengths \(gen\_lengen\\\_len: 256, 512, 1024\)\. As shown in the Table[8](https://arxiv.org/html/2606.26120#A5.T8), both DCU and APD introduce minimal additional latency, and this overhead exhibits a clear linear scaling trend with generation length, which only accounts for a small portion of the inference latency, far less than the gain it brings\.
Table 8:Performance metrics \(scores and time\-related data\) of baseline, DCU, and APD under different generation lengths \(gen\_len\)\. Time values are rounded to two decimal places\.
### E\.3Performance in Low\-resource Hardware
To address low\-resource hardware deployment, we tested our method with the LLaDA\-8B\-Instruct model on the GSM8K and HumanEval datasets using an RTX 4090 GPU\. As shown in the Table[9](https://arxiv.org/html/2606.26120#A5.T9), our method still maintains a strong speed\-accuracy trade\-off under such cost\-constrained settings, demonstrating its practicality for deployment on non\-high\-end hardware\.
Table 9:Performance Scores and Inference Speed \(TPS\) of Different Methods on RTX 4090Similar Articles
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.
Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models
This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
This paper introduces LEAP, a training-free method to accelerate inference in Diffusion Language Models (dLLMs) by detecting early-converging tokens, reducing denoising steps by 30% without losing accuracy.
Fast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM Inference
Fast-dLLM++ introduces Fréchet profile decoding for diffusion LLMs, a training-free method that selects parallel commit sets based on heterogeneous confidence profiles, achieving up to 37% higher throughput at comparable accuracy on benchmarks with LLaDA-8B.
Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation
This paper introduces Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE) to accelerate inference in diffusion-based large language models by dynamically deciding when tokens have converged and forecasting logit trends, reducing unnecessary denoising steps while preserving output quality.