Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

arXiv cs.CL 06/10/26, 04:00 AM Papers
Summary
This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.
arXiv:2606.10537v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose Prefilling-dLLM, a training-free prefill-decode disaggregation framework for dLLMs that partitions the prefix into N chunks, caches their KV representations once, and selects the top-K most relevant chunks with intra-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per-step complexity from quadratic in the full sequence length to quadratic only in the decode length. On LongBench and InfiniteBench, Prefilling-dLLM achieves state-of-the-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non-contiguously cached chunk KV yields 9.1--28.0x speedup at 8K--32K contexts. We further show that beginning-of-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost-in-the-middle phenomenon. Code is available at https://github.com/menik1126/Prefilling-dLLM.
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:11 AM
# Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models
Source: [https://arxiv.org/html/2606.10537](https://arxiv.org/html/2606.10537)
Jing Xiong1,Qi Han1,Shansan Gong1,Yunta Hsieh2, Chengyue Wu1,Chaofan Tao1,Chenyang Zhao3,Ngai Wong1

1The University of Hong Kong,2University of Michigan, Ann Arbor,3LMSYS Org

###### Abstract

Diffusion large language models \(dLLMs\) re\-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long\-context scenarios\. We proposePrefilling\-dLLM, a*training\-free*prefill\-decode disaggregation framework for dLLMs that partitions the prefix intoNNchunks, caches their KV representations once, and selects the top\-KKmost relevant chunks with intra\-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per\-step complexity from quadratic in the full sequence length to quadratic only in the decode length\. On LongBench and InfiniteBench,Prefilling\-dLLMachieves state\-of\-the\-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non\-contiguously cached chunk KV yields 9\.1–28\.0×\\timesspeedup at 8K–32K contexts\. We further show that beginning\-of\-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost\-in\-the\-middle phenomenon\.111Our code is available at[https://github\.com/menik1126/Prefilling\-dLLM](https://github.com/menik1126/Prefilling-dLLM)\.

Prefilling\-dLLM: Predictive Prefilling for Long\-Context Inference in Diffusion Language Models

Jing Xiong1, Qi Han1, Shansan Gong1, Yunta Hsieh2,Chengyue Wu1,Chaofan Tao1,Chenyang Zhao3,Ngai Wong11The University of Hong Kong,2University of Michigan, Ann Arbor,3LMSYS Org

## 1Introduction

Diffusion large language models \(dLLMs\) have emerged as a promising alternative to autoregressive \(AR\) models, offering the ability to generate multiple tokens in parallel through iterative denoising\(Nieet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib25); Yeet al\.,[2025a](https://arxiv.org/html/2606.10537#bib.bib24); Sahooet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib56); Austinet al\.,[2021](https://arxiv.org/html/2606.10537#bib.bib52)\)\. Unlike AR models that produce tokens sequentially from left to right, dLLMs corrupt and reconstruct entire sequences simultaneously, enabling flexible generation orders and potentially faster inference\(Wuet al\.,[2025b](https://arxiv.org/html/2606.10537#bib.bib22); Wanget al\.,[2025a](https://arxiv.org/html/2606.10537#bib.bib44)\)\. However, this paradigm introduces a critical inefficiency in long\-context scenarios: the entire input prefix must participate in every denoising step, even though its representation remains largely unchanged across iterations\.

In autoregressive LLM serving, the*prefill\-decode disaggregation*architecture\(Zhonget al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib1)\)assigns the prefill and decode phases to separate GPU clusters, exploiting their distinct computational profiles \(prefill is compute\-bound while decode is memory\-bound\) to maximize hardware utilization and serving throughput\. In contrast, dLLM inference is fundamentally compute\-bound throughout: since the entire sequence \(prefix \+ decode\) must be jointly processed at every denoising step, each iteration performs a full forward pass over the combined sequence, making the workload dominated by matrix multiplications rather than*memory bandwidth*\. This compute\-bound nature persists across all denoising iterations, unlike AR decoding where only a single new token is appended per step\. Recent work on dLLM acceleration has explored KV caching strategies\(Maet al\.,[2026](https://arxiv.org/html/2606.10537#bib.bib5); Liuet al\.,[2025b](https://arxiv.org/html/2606.10537#bib.bib23); Nguyen\-Triet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib30)\)and sparse attention mechanisms\(Wanget al\.,[2025b](https://arxiv.org/html/2606.10537#bib.bib43); Songet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib28); Jianget al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib31)\), yet none explores disaggregating the prefill and decode stages to avoid repeated long\-context computation across denoising iterations\.

Our key insight is that in long\-context dLLM inference, the input prefix is redundantly processed at every denoising iteration, yet attention from response tokens to the prefix exhibits strong locality bias that intensifies across steps, and only a small fraction of prefix tokens are actively attended to\. Motivated by this observation, we presentPrefilling\-dLLM\(PrefillingfordiffusionLLMs\), which computes the prefix KV cache once in a dedicated prefill stage and reuses it across all decode steps without recomputation\. Specifically, we partition the prefix intoNNfixed\-size chunks of sizeCCwith intra\-chunk attention, reducing prefill complexity fromO\(Lp2\)O\(L\_\{p\}^\{2\}\)toO\(N⋅C2\)O\(N\\cdot C^\{2\}\)and enabling parallel processing across devices\. During decode, we retain only a small subset of relevant chunks via retrieval\-augmented generation\(Jianget al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib41); Laiet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib42); Xuet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib48); Yuanet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib49)\), reducing complexity fromO\(\(Lp\+Ld\)2⋅T\)O\(\(L\_\{p\}\+L\_\{d\}\)^\{2\}\\cdot T\)toO\(N⋅C2\+\(Ld2\+K⋅C\)⋅T\)O\(N\\cdot C^\{2\}\+\(L\_\{d\}^\{2\}\+K\\cdot C\)\\cdot T\), whereKKis the number of selected chunks andTTis the number of denoising steps\.

We evaluatePrefilling\-dLLMon LongBench and InfiniteBench, achieving 9\.1–28\.0×\\timesspeedup at 8K–32K contexts with state\-of\-the\-art quality among dLLM acceleration methods\. Our contributions are as follows:

- •We proposePrefilling\-dLLM, a*training\-free prefill\-decode disaggregation*framework for dLLMs\. By prefilling the prefix KV cache once and sharing it across all denoising iterations, we eliminate recomputation and achieve significant speedups that scale with context length\.
- •We introduce*sparse prefilling*that selects relevant chunks and tokens, reducing complexity fromO\(\(Lp\+Ld\)2⋅T\)O\(\(L\_\{p\}\+L\_\{d\}\)^\{2\}\\cdot T\)toO\(N⋅C2\+\(Ld2\+K⋅C\)⋅T\)O\(N\\cdot C^\{2\}\+\(L\_\{d\}^\{2\}\+K\\cdot C\)\\cdot T\)\. Combined with an optimized attention kernel that parallelizes decoding over the cached chunk KV, this yields up to 28×\\timesend\-to\-end speedup at 32K\.
- •We show that BOS tokens prepended to each chunk act as periodic attention anchors, mitigating the lost\-in\-the\-middle phenomenon in dLLMs without introducing attention sinks\.

## 2Related Work

### 2\.1Diffusion Language Models

Diffusion models have been extended from continuous domains to discrete text generation through various formulations\. Early work explored continuous diffusion over word embeddings\(Liet al\.,[2022](https://arxiv.org/html/2606.10537#bib.bib55); Gonget al\.,[2022](https://arxiv.org/html/2606.10537#bib.bib54)\)and masked diffusion over discrete tokens\(Austinet al\.,[2021](https://arxiv.org/html/2606.10537#bib.bib52); Heet al\.,[2023](https://arxiv.org/html/2606.10537#bib.bib53); Sahooet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib56)\)\. More recently, masked discrete diffusion has been scaled to large language models\(Gonget al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib18)\): LLaDA\(Nieet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib25)\)demonstrated that masked diffusion can match autoregressive models at the 8B parameter scale, while Dream\(Yeet al\.,[2025a](https://arxiv.org/html/2606.10537#bib.bib24)\)and MDLM\(Sahooet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib56)\)further validated the effectiveness of this paradigm\. Subsequent efforts have focused on scaling\(Bieet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib17); Gonget al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib18)\), preference alignment\(Zhuet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib50)\), and extending dLLMs to long contexts\(Liuet al\.,[2025a](https://arxiv.org/html/2606.10537#bib.bib34); Heet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib35)\)and multimodal settings\(Youet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib14)\)\. Despite these advances, the efficiency of dLLMs in long\-context scenarios remains underexplored\.

### 2\.2Efficient Inference for dLLMs

In autoregressive LLMs, sparse attention methods such as MInference\(Jianget al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib41)\), DCA\(Anet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib73)\), FlexPrefill\(Laiet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib42)\), XAttention\(Xuet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib48)\)and NSA\(Yuanet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib49)\)reduce long\-context attention cost via adaptive or block\-sparse patterns, while StreamingLLM\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib10)\), H2O\(Zhanget al\.,[2023](https://arxiv.org/html/2606.10537#bib.bib37)\), and SnapKV\(Liet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib38)\)compress the KV cache by retaining only important entries\. However, these techniques target causal attention where a KV cache is naturally built during left\-to\-right generation, and do not directly apply to the bidirectional attention in dLLMs where no such cache exists\. For dLLMs, Fast\-dLLM\(Wuet al\.,[2025b](https://arxiv.org/html/2606.10537#bib.bib22)\)and Fast\-dLLM v2\(Wuet al\.,[2025a](https://arxiv.org/html/2606.10537#bib.bib51)\)introduce KV caching across denoising steps by reusing key\-value representations from previous iterations\. dKV\-Cache\(Maet al\.,[2026](https://arxiv.org/html/2606.10537#bib.bib5)\)proposes adaptive caching that selectively updates KV entries based on token confidence\. SparseD\(Wanget al\.,[2025b](https://arxiv.org/html/2606.10537#bib.bib43)\), Sparse\-dLLM\(Songet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib28)\), d2Cache\(Jianget al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib31)\), Focus\-dLLM\(Longet al\.,[2026](https://arxiv.org/html/2606.10537#bib.bib72)\)and LoSAXiet al\.\([2026](https://arxiv.org/html/2606.10537#bib.bib71)\)exploit inherent attention sparsity for dynamic cache eviction\. However, all these methods operate within the standard inference loop where the entire sequence is processed at every step\. Our work instead disaggregates the prefix computation from iterative decoding at the system level, and applies sparse chunk retrieval over a static prefix KV cache\.

### 2\.3Prefill\-Decode Disaggregation

In autoregressive LLM serving, prefill is compute\-bound while decode is memory\-bound\. DistServe\(Zhonget al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib1)\)exploits this asymmetry by assigning the two phases to separate GPU clusters\. Mooncake\(Qinet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib2)\)transfers KV caches between prefill and decode nodes via a distributed cache pool, SPAD\(Zhanget al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib3)\)designs specialized hardware for each phase, and Semi\-PD\(Honget al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib4)\)introduces a hybrid approach with disaggregated computation and unified storage\. This principle has not been applied to dLLMs, where every denoising step performs a full forward pass over the entire sequence, making inference compute\-bound throughout\. Our work bridges this gap by computing the prefix KV cache once and reusing it across all denoising iterations, and further analyzes the potential memory bottleneck introduced by caching\.

## 3Preliminary: Masked Diffusion Models

Masked diffusion language models \(dLLMs\) define a forward noising process\(Sahooet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib56); Gonget al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib18); Yeet al\.,[2025a](https://arxiv.org/html/2606.10537#bib.bib24)\)that progressively corrupts a discrete token sequence𝐱0=\(x1,…,xL\)\\mathbf\{x\}\_\{0\}=\(x\_\{1\},\\ldots,x\_\{L\}\)by replacing tokens with a special\[MASK\]token\. At each diffusion timestept∈\[0,1\]t\\in\[0,1\], each token is independently masked with probabilitytt, yielding a noised sequence𝐱t\\mathbf\{x\}\_\{t\}\. The reverse \(denoising\) process is parameterized by a neural networkpθ\(𝐱0\|𝐱t\)p\_\{\\theta\}\(\\mathbf\{x\}\_\{0\}\|\\mathbf\{x\}\_\{t\}\)that predicts the original clean tokens given the partially masked input\.

During training, the model is optimized to minimize the cross\-entropy loss over masked positions:

ℒ=𝔼t,𝐱0,𝐱t\[−∑i:xti=\[M\]log⁡pθ\(x0i\|𝐱t\)\]\.\\mathcal\{L\}=\\mathbb\{E\}\_\{t,\\mathbf\{x\}\_\{0\},\\mathbf\{x\}\_\{t\}\}\\left\[\-\\sum\_\{i:x\_\{t\}^\{i\}=\\texttt\{\[M\]\}\}\\log p\_\{\\theta\}\(x\_\{0\}^\{i\}\|\\mathbf\{x\}\_\{t\}\)\\right\]\.\(1\)
During inference, the model starts from a fully masked sequence and iteratively unmasks tokens overTTdenoising steps\. At each step, the model predicts all masked positions simultaneously, and a subset of high\-confidence predictions are unmasked according to a scheduling strategy\. This parallel decoding enables dLLMs to generate multiple tokens per step, but at each step the model performs full self\-attention over the entire sequence \(prefix \+ response\), resulting in computational cost that scales with the total length at every iteration\.

## 4Motivation

### 4\.1Lost\-in\-the\-Middle in dLLMs

Autoregressive LLMs suffer from the “lost\-in\-the\-middle” phenomenon\(Liuet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib65)\), where retrieval accuracy drops for information placed in the middle of the context\. We evaluate whether dLLMs share this bias using a position\-controlled multi\-document QA task and find three key observations: \(i\) Within the native training range \(256–2K tokens\) and YaRN×\\times2 extrapolation \(4K\), Dream\-7B achieves perfect accuracy at all positions; \(ii\) Further extrapolation \(8K, 16K, 32K\) introduces emerging positional sensitivity \(Figure[1](https://arxiv.org/html/2606.10537#S4.F1)\), with accuracy skewing toward positions closer to the response, unlike the U\-shaped curve in AR LLMs where both the beginning and end are favored; \(iii\) In dLLMs, bidirectional attention produces a monotonic decay: tokens near the response receive strong attention regardless of their absolute position, while distant tokens are uniformly neglected\. This locality\-driven degradation motivates our chunk\-based selective retrieval strategy\.

![Refer to caption](https://arxiv.org/html/2606.10537v1/x1.png)Figure 1:Lost\-in\-the\-Middle evaluation on Dream\-7B \(training length = 2K\)\. Context extrapolation via YaRN scaling\. Native range \(256–2K\) and YaRN×\\times2 \(4K\) achieve EM = 1\.0 across all positions\. YaRN×\\times4 \(8K\), YaRN×\\times8 \(16K\), and YaRN×\\times16 \(32K\) show increasing degradation\. Each position is evaluated with 30 samples; 10 evenly spaced positions per context length\.
### 4\.2Locality of Attention Decay

![Refer to caption](https://arxiv.org/html/2606.10537v1/x2.png)Figure 2:Attention weight decay as a function of distance from response tokens to prefix tokens, measured at different denoising steps\. Attention decays rapidly with distance, exhibiting strong locality bias\. This decay becomes more pronounced in later denoising steps as token predictions stabilize\.We further analyze the attention patterns of Dream\-7B during denoising to understand how response tokens attend to the prefix\. We measure the average attention weight from response tokens to prefix tokens as a function of distance \(number of tokens separating them\)\. We observe three key findings \(Figure[2](https://arxiv.org/html/2606.10537#S4.F2)\): \(i\) Attention weights decay rapidly with distance, with response tokens concentrating most of their attention mass on nearby prefix tokens; \(ii\) The decay becomes more pronounced as denoising progresses and token predictions stabilize, suggesting that full attention over the entire prefix is largely redundant in later steps; \(iii\) Beyond the overall decay trend, attention exhibits sparse, quasi\-periodic spikes at specific prefix positions, with the dominant spike concentrating 25% of attention mass, stable across all denoising steps and consistent across layers 5–27, corresponding to salient tokens \(e\.g\., segment boundaries\)\.

This locality and sparsity pattern directly motivates ourPrefilling\-dLLMdesign: since distant prefix tokens contribute negligibly to response generation and a small number of chunks capture the majority of useful attention signal, we can cache the prefix KV once with parallel chunk processing and selectively retrieve only relevant chunks during decoding, achieving significant speedups\.

![Refer to caption](https://arxiv.org/html/2606.10537v1/x3.png)Figure 3:Overview ofPrefilling\-dLLM\.\(I\) Prefill:The prefix is partitioned intoNNchunks, each independently prefilled with intra\-chunk attention to produce per\-chunk KV caches; chunks are ranked by a predictive score combining self\-information and pseudo\-label logits; the top\-KKchunks are selected, and only the top\-BBquery\-relevant tokens per chunk are retained in the KV cache\.\(II\) Sparse Attention:During decoding, only the selected chunks’ KV caches participate in cross\-attention with the response tokens\.\(III\) Decoding:Iterative denoising progressively unmasks the response overTTsteps, reusing the cached KV without recomputation\.

## 5Method

We presentPrefilling\-dLLM, a two\-stage framework that disaggregates prefix computation from iterative denoising\. Instead of re\-encoding the prefix at every denoising step, we process it once in a prefill stage and cache its KV representations for reuse during decoding\. This reduces computational complexity fromO\(\(Lp\+Ld\)2⋅T\)O\(\(L\_\{p\}\+L\_\{d\}\)^\{2\}\\cdot T\)toO\(N⋅C2\+\(Ld2\+K⋅C\)⋅T\)O\(N\\cdot C^\{2\}\+\(L\_\{d\}^\{2\}\+K\\cdot C\)\\cdot T\), whereN=⌈Lp/C⌉N=\\lceil L\_\{p\}/C\\rceilis the number of chunks andCCis the chunk size\. The prefill cost scales linearly with prefix length, while the decoding cost is independent of it\.

### 5\.1Prefill

Inspired by the attention decay observed in Section[2](https://arxiv.org/html/2606.10537#S4.F2), we observe that attending to allNNchunks is unnecessary, nor do chunks need to attend to each other\. Instead, we propose a predictive prefill strategy that independently processes each chunk, scores its relevance to the query, and selects only the top\-KKinformative chunks for decoding\.

#### Chunk Prefill\.

We partition the prefix intoN=⌈Lp/C⌉N=\\lceil L\_\{p\}/C\\rceilnon\-overlapping chunks\{𝐜1,…,𝐜N\}\\\{\\mathbf\{c\}\_\{1\},\\ldots,\\mathbf\{c\}\_\{N\}\\\}, each of sizeCCtokens\. A special BOS token is prepended to each chunk as a delimiter\. We obtain*pseudo\-labels*𝐦\\mathbf\{m\}by running iterative denoising over the query with each chunk to produce an initial response estimate; these pseudo\-labels guide chunk scoring without requiring ground\-truth targets\. For each chunk, we concatenate it with the query tokens𝐪\\mathbf\{q\}and𝐦\\mathbf\{m\}to form the input\[𝐜i;𝐪;𝐦\]\[\\mathbf\{c\}\_\{i\};\\mathbf\{q\};\\mathbf\{m\}\], and perform a forward pass\. This yields per\-chunk KV caches:

𝐊i,𝐕i=IntraAttn\(𝐜i\),𝐊i,𝐕i∈ℝH×C×d\\mathbf\{K\}\_\{i\},\\mathbf\{V\}\_\{i\}=\\text\{IntraAttn\}\(\\mathbf\{c\}\_\{i\}\),\\quad\\mathbf\{K\}\_\{i\},\\mathbf\{V\}\_\{i\}\\in\\mathbb\{R\}^\{H\\times C\\times d\}\(2\)
whereHHis the number of attention heads andddis the head dimension\. Since chunks are independent, they can be processed in parallel across devices\. The prefill complexity isO\(N⋅C2\)O\(N\\cdot C^\{2\}\)\.

#### Predictive Score\.

We score each chunk using two complementary signals obtained during prefill to evaluate its relevance as an*inter\-chunk sparsity estimator*\. First, we compute theSelf\-Information Scoreas the negative log\-likelihood of the query window𝐪\\mathbf\{q\}conditioned on the chunk:

sI\(𝐜i\)=−1\|𝐪\|∑j=1\|𝐪\|log⁡pθ\(qj∣𝐜i,𝐪<j\)s\_\{\\text\{I\}\}\(\\mathbf\{c\}\_\{i\}\)=\-\\frac\{1\}\{\|\\mathbf\{q\}\|\}\\sum\_\{j=1\}^\{\|\\mathbf\{q\}\|\}\\log p\_\{\\theta\}\(q\_\{j\}\\mid\\mathbf\{c\}\_\{i\},\\mathbf\{q\}\_\{<j\}\)\(3\)A lower NLL indicates that the chunk provides more information relevant to the query\. Second, we compute thePseudo\-Label Scoreusing the pseudo\-labels𝐦\\mathbf\{m\}obtained during prefill\. We evaluate how well each chunk predicts these pseudo\-labels:

sP\(𝐜i\)=−1\|𝐦\|∑j=1\|𝐦\|log⁡pθ\(mj∣𝐜i,𝐪\)s\_\{\\text\{P\}\}\(\\mathbf\{c\}\_\{i\}\)=\-\\frac\{1\}\{\|\\mathbf\{m\}\|\}\\sum\_\{j=1\}^\{\|\\mathbf\{m\}\|\}\\log p\_\{\\theta\}\(m\_\{j\}\\mid\\mathbf\{c\}\_\{i\},\\mathbf\{q\}\)\(4\)where𝐦\\mathbf\{m\}denotes the pseudo\-labels obtained from a preliminary diffusion generation\.

### 5\.2Sparse Attention

Our framework introduces sparsity at two levels\.

#### Intra\-chunk sparsity\.

During prefill, each chunk performs bidirectional self\-attention only within itself, avoiding the quadratic cost of full\-prefix attention\. The query tokens participate in bidirectional attention with each chunk and serve as a proxy to evict irrelevant tokens from the chunk’s KV cache, retaining only the most informative entries for decoding\. Specifically, for each tokencjc\_\{j\}in chunk𝐜i\\mathbf\{c\}\_\{i\}, we compute its eviction score as the cumulative bidirectional attention weight between the token and the query:

e\(cj\)=∑k=1\|𝐪\|Attn\(qk,cj\)\+∑k=1\|𝐪\|Attn\(cj,qk\)e\(c\_\{j\}\)=\\sum\_\{k=1\}^\{\|\\mathbf\{q\}\|\}\\text\{Attn\}\(q\_\{k\},c\_\{j\}\)\+\\sum\_\{k=1\}^\{\|\\mathbf\{q\}\|\}\\text\{Attn\}\(c\_\{j\},q\_\{k\}\)\(5\)Tokens are ranked by eviction score and only the top\-BBtokens per chunk are retained in the KV cache, maintaining a fixed budget while preserving query\-relevant information\.

#### Inter\-chunk sparsity\.

We rank chunks by the combined scores\(𝐜i\)=sI\(𝐜i\)\+sP\(𝐜i\)s\(\\mathbf\{c\}\_\{i\}\)=s\_\{\\text\{I\}\}\(\\mathbf\{c\}\_\{i\}\)\+s\_\{\\text\{P\}\}\(\\mathbf\{c\}\_\{i\}\)and retain only the top\-KKchunks \(K≪NK\\ll N\)\. During decoding, only theseKKchunks participate in attention, so the response tokens attend toK⋅BK\\cdot Bprefix tokens rather than the full prefix of lengthLpL\_\{p\}, significantly reducing the per\-step computation\.

### 5\.3Decoding

#### Prefix Reuse\.

The KV cache of the selectedKKchunks is fixed after prefill and remains static across allTTdenoising steps\. At each denoising step, the query and response tokens are jointly processed to produce KV representations, which are then concatenated with the cached KV of the selected chunks\. The query tokens are recomputed at each step as they participate in bidirectional attention with the denoised response\. This yields a per\-step cost ofO\(Ld2\)O\(L\_\{d\}^\{2\}\)instead ofO\(\(Lp\+Ld\)2\)O\(\(L\_\{p\}\+L\_\{d\}\)^\{2\}\)\.

#### Iterative Denoising\.

Starting from a fully masked response sequence, the model iteratively unmasks tokens overTTdenoising steps\. At each step, the model predicts all remaining masked positions simultaneously, and tokens whose confidence exceeds a thresholdτ\\tauare unmasked\. As denoising progresses, the number of masked tokens decreases monotonically until the full response is revealed\.

## 6Experiments

### 6\.1Setup

#### Benchmarks\.

We evaluatePrefilling\-dLLMon two long\-context benchmarks: LongBench\(Baiet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib58)\), which covers a set of tasks including single\-document QA, multi\-document QA, summarization, few\-shot learning, synthetic tasks, and code completion with context lengths ranging from 2K to 32K tokens; and InfiniteBench\(Zhanget al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib62)\), which extends to contexts exceeding 100K tokens with tasks such as long\-document retrieval, book\-level QA, and mathematical reasoning\.

### 6\.2Main Results

Intra\-chunk sparsity can improve performance in dLLMs, contrary to the performance drop observed in AR LLMs under sparse attention\.

Table 1:Performance comparison on LongBench\.Boldindicates the best performance among acceleration methods\. In sparse variants, we retain the top\-BBhighest\-attention tokens per chunk, withB=512B=512in our experiments\.#### LongBench\.

Table[1](https://arxiv.org/html/2606.10537#S6.T1)presents the performance comparison on LongBench\. We highlight several observations: \(i\) On Dream\-7B, Ours \(inter \+ intra\-sparsity\) reaches the best average score among acceleration methods \(34\.59\), with a large gain on RB\-P over both Vanilla and the strongest non\-ours acceleration baseline on this task \(57\.98 vs\. 41\.99 and 29\.23\); on UltraLLaDA, it reaches 37\.02 average score, exceeding Sparse\-dLLM \(36\.68\), dKV\-Cache \(36\.29\), and Fast\-dLLM \(35\.98\); \(ii\) On UltraLLaDA, the twoPrefilling\-dLLMvariants jointly obtain the best results on 9 out of 16 subtasks, with gains on context\-sensitive tasks such as MF\-en \(39\.94 vs\. 37\.31\) and RB\-P \(62\.67 vs\. 54\.97\), demonstrating that inter\-chunk sparsity effectively identifies relevant context; \(iii\) Compared with inter\-only sparsity, adding intra\-chunk sparsity improves the average score from 22\.01 to 34\.59 on Dream\-7B and from 35\.64 to 37\.02 on UltraLLaDA while further reducing computation\.

#### InfiniteBench\.

We evaluatePrefilling\-dLLMon InfiniteBench with contexts exceeding 128K tokens\. Results are presented in Table[2](https://arxiv.org/html/2606.10537#S6.T2)\. On Dream\-7B, Ours \(inter \+ intra\-sparsity\) achieves 43\.62 average accuracy, surpassing the strongest baseline Fast\-dLLM v2 \(30\.32\) by over 13 points, with particularly strong gains on Passkey \(95\.42\) and Number retrieval \(70\.00\), further confirming that sparsity improves performance, all without any additional training\.

Table 2:Performance comparison on InfiniteBench\. Accuracy is reported as percentage\.Boldindicates the best performance among acceleration methods\. In sparse variants, we retain the top\-BBhighest\-attention tokens per chunk, withB=512B=512in our experiments\.

### 6\.3Efficiency Analysis

We show thatPrefilling\-dLLMscales sub\-linearly with length via fixed\-size chunk selection, surpassing the strongest baseline Sparse\-dLLM at 16K and 32K\.

![Refer to caption](https://arxiv.org/html/2606.10537v1/x4.png)Figure 4:Throughput comparison \(tokens/s\) on LongBench MF\-en at varying context lengths \(Dream\-7B, bf16, GQA with 32 query heads, 8 KV heads, head dim 128, single A800 GPU, 32 generated tokens, 5 measured samples\)\. Labels above bars show speedup relative to the Transformers baseline\.As shown in Figure[4](https://arxiv.org/html/2606.10537#S6.F4), we highlight several key observations: \(i\)Prefilling\-dLLMachieves increasing speedups as context grows \(9\.1×\\timesat 8K, 16\.1×\\timesat 16K, 28\.0×\\timesat 32K\), because it compresses the context to a fixed budget \(top\-4 chunks×\\times1024 tokens≈\\approx4K\) regardless of input length, while all baselines must process the full context at every denoising step; \(ii\) Sparse\-dLLM is fastest at 8K \(16\.62 tok/s\) through aggressive token eviction, but degrades rapidly at longer contexts \(3\.51 tok/s at 32K\) because its eviction ratio is fixed; \(iii\) In contrast,Prefilling\-dLLMsurpasses Sparse\-dLLM at both 16K and 32K, demonstrating that retrieval\-augmented generation provides better quality \(Table[1](https://arxiv.org/html/2606.10537#S6.T1)\) and superior scaling efficiency\.

#### Attention Kernel Comparison\.

Loading the entire cached prefix KV at every denoising step creates a memory bottleneck\. We adopt Split\-S FlexAttention to address this, achieving up to 10\.2×\\timesspeedup over vanilla FlexAttention\.We benchmark attention kernel options for the two phases of PD\-separated dLLM inference\. For prefilling, we compare Flash Attention\(Daoet al\.,[2022](https://arxiv.org/html/2606.10537#bib.bib63)\), FlexAttention\(Donget al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib66)\), xFormers Attention\(Rabe and Staats,[2021](https://arxiv.org/html/2606.10537#bib.bib36)\), and FlashInfer\(Yeet al\.,[2025b](https://arxiv.org/html/2606.10537#bib.bib68)\)\. As shown in Figure[5](https://arxiv.org/html/2606.10537#S6.F5)\(a\), FlashInfer and Flash Attention achieve the lowest prefilling latency, while FlexAttention adds 1\.4–1\.5×\\timesoverhead from block mask evaluation and xFormers is 1\.6–1\.8×\\timesslower\.

For decoding under PD separation, each denoising step computes attention with query lengthLdL\_\{d\}against KV lengthLp\+LdL\_\{p\}\+L\_\{d\}, creating a highly asymmetric pattern \(Ld≪LpL\_\{d\}\\ll L\_\{p\}\)\. FlexAttention parallelizes only along the query dimension, severely underutilizing the GPU\. We applySplit\-S decompositionthat directly operates on theSSnon\-contiguously stored chunk KV caches from prefilling, computes attention independently per chunk, and merges partial results via log\-sum\-exp reduction, avoiding the need to gather chunk KV into contiguous memory and achieving 5\.8–10\.2×\\timesspeedup \(Figure[5](https://arxiv.org/html/2606.10537#S6.F5)b\)\.

Finding 1:*Under PD separation, chunked prefilling naturally partitions the prefix KV cache into independently addressable segments\. Reusing these independent KV segments for Split\-S parallel decoding yields 5\.8–10\.2×\\timeslatency reduction, turning the prefill\-stage partitioning into a direct decode\-stage speedup\.*

![Refer to caption](https://arxiv.org/html/2606.10537v1/x5.png)Figure 5:Attention kernel benchmark \(bf16, GQA 32/8 heads, single A800\)\.\(a\) Prefilling:FlashInfer achieves the lowest latency; Flash Attention is 1\.0–1\.2×\\timesslower; FlexAttention adds 1\.4–1\.5×\\timesoverhead; xFormers is 1\.6–1\.8×\\timesslower\. Labels show relative slowdown vs\. FlashInfer\.\(b\) Decoding \(PD separation,Ld=32L\_\{d\}\{=\}32,S=4S\{=\}4splits\):Split\-S FlexAttention partitions the KV dimension intoSSchunks and processes them in parallel via batch dimension\.

### 6\.4Lost\-in\-the\-Middle

As shown in Section[1](https://arxiv.org/html/2606.10537#S4.F1), dLLMs exhibit positional sensitivity under context extrapolation, with retrieval accuracy degrading for information placed in the middle of long contexts\. We investigate whetherPrefilling\-dLLMmitigates this effect by evaluating on the same position\-controlled multi\-document QA task\(Liuet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib65)\)\. SincePrefilling\-dLLMselects the most relevant chunks via predictive scoring rather than relying on positional proximity, we hypothesize that it can attend to informative tokens regardless of their position in the prefix\.

![Refer to caption](https://arxiv.org/html/2606.10537v1/x6.png)Figure 6:Lost\-in\-the\-Middle evaluation on Dream\-7B \(training length = 2K\)\. We comparePrefilling\-dLLM\(solid\) against Vanilla Dream \(dashed\) with YaRN extrapolation across 4K–32K contexts, measuring exact\-match accuracy as a function of gold document position\.Prefilling\-dLLMmaintains consistently high EM across all positions and context lengths, while Vanilla collapses at 16K and 32K\.Finding 2:*Inter\-chunk sparsity eliminates the lost\-in\-the\-middle phenomenon in dLLMs, enabling position\-invariant needle retrieval across all context lengths\.*

The periodic attention spikes that cause positional bias in Vanilla inference become the signal thatPrefilling\-dLLMleverages for position\-invariant chunk retrieval, transforming catastrophic failure into mild degradation at 32K\.

### 6\.5Attention Sink Analysis

Do the periodic attention spikes from chunk\-level BOS tokens degenerate into attention sinks\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib10)\)?We investigate whether they absorb disproportionate attention mass and bias chunk selection toward positional artifacts\.

We analyze the attention patterns during generation for bothPrefilling\-dLLMand Vanilla Dream \(YaRN×\\times4, 8K context\), measuring the fraction of attention mass absorbed by the first\-1 token, first\-5 tokens, and all BOS tokens across all 28 layers\. Both conditions use the full 8K context without chunk selection: Vanilla Dream processes the flat token sequence, whilePrefilling\-dLLMsegments it into 8 chunks of 1024 tokens with a BOS delimiter prepended to each chunk\.

![Refer to caption](https://arxiv.org/html/2606.10537v1/x7.png)Figure 7:Attention sink analysis \(8K context, YaRN×\\times4\)\.Left:Per\-layer attention ratio absorbed by the first\-1, first\-5, and all chunk BOS tokens; both methods stay below 1% on average, with the BOS token absorbing only 0\.59% \(Vanilla\) and 0\.30% \(Prefilling\-dLLM\)\.Right:Attention profile at layer 14 \(log scale\); green dashes mark chunk BOS positions, showing periodic attention spikes that distribute mass uniformly\.Finding 3:*The position\-invariant retrieval in Finding 2 is enabled by chunk\-level BOS tokens, which act as periodic attention anchors that distribute attention mass across the context rather than degenerating into attention sinks\.*

We highlight several observations from Figure[7](https://arxiv.org/html/2606.10537#S6.F7)and Figure[8](https://arxiv.org/html/2606.10537#S6.F8): \(i\) Unlike AR LLMs where the BOS token absorbs 20–60% of attention mass\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.10537#bib.bib10)\), the first token in dLLMs absorbs only 0\.59% \(Vanilla\) and 0\.30% \(Prefilling\-dLLM\) on average across layers; \(ii\) Even when summing over all 9 chunk BOS tokens, the total BOS attention inPrefilling\-dLLMis only 2\.72%, confirming that chunk BOS tokens serve as segment delimiters without becoming parasitic attention sinks; \(iii\) As shown in Figure[8](https://arxiv.org/html/2606.10537#S6.F8), Vanilla Dream develops a mild ridge only at the sequence start, whilePrefilling\-dLLMexhibits periodic ridges at chunk BOS positions that remain stable throughout denoising without growing into dominant peaks\.

![Refer to caption](https://arxiv.org/html/2606.10537v1/x8.png)

![Refer to caption](https://arxiv.org/html/2606.10537v1/x9.png)

Figure 8:Attention landscape during denoising \(layer 14, log\-scale\)\.Left:Vanilla Dream shows a flat landscape with mild elevation at the sequence start\.Right:Prefilling\-dLLMexhibits periodic ridges \(cyan lines\) at chunk BOS positions, serving as stable attention anchors without forming dominant sinks\.Finding 4:*Chunk BOS tokens form stable, low\-magnitude attention ridges throughout denoising, acting as distributed anchors rather than accumulating into dominant sinks\.*

### 6\.6Effect of Chunk Size on Prefilling

![Refer to caption](https://arxiv.org/html/2606.10537v1/x10.png)Figure 9:Effect of chunk size on Lost\-in\-the\-Middle retrieval accuracy\. We fix the total token budget \(top\-kk×\\timeschunk\_size==4096\) and vary the chunk granularity across 8K–128K contexts\. The base model is Vanilla Dream\-7B with a 2K training length; all longer contexts require extrapolation\. Smaller chunks \(256–512\) maintain\>\>90% EM even at 128K \(×\\times64 extrapolation\), while large chunks \(4096, top\-1\) degrade sharply beyond 32K\.We investigate how chunk size affects downstream task performance under a fixed token budget\. Specifically, we keep the total number of selected tokens constant at 4096 \(i\.e\., top\-kk×\\timeschunk\_size==4096\) and vary the chunk granularity across 256, 512, 1024, 2048, and 4096 tokens\. Figure[9](https://arxiv.org/html/2606.10537#S6.F9)shows the results on the Lost\-in\-the\-Middle benchmark at 8K, 16K, 32K, 64K, and 128K context lengths\.

The results reveal a clear trade\-off that intensifies with context length\. At 8K, smaller chunks \(256 tokens, top\-16\) achieve the highest accuracy \(96\.0% EM\)\. As context grows to 16K–64K, chunk size 1024 \(top\-4\) becomes optimal \(92\.0%, 92\.7%, and 95\.3% EM for 16K, 32K, and 64K respectively\)\. At 128K \(×\\times64 extrapolation\), finer granularity becomes essential: chunk size 256 \(top\-16\) achieves 92\.3% EM, while chunk size 1024 drops to 83\.0%\. Too\-large chunks \(4096 tokens, top\-1\) degrade sharply beyond 32K \(47\.0% at 64K, 68\.0% at 128K\), confirming that multi\-chunk prefilling is essential for long\-context dLLM inference\.

Finding 5:*Chunk size creates a quality–efficiency trade\-off: smaller chunks improve retrieval accuracy but underutilize GPU compute\. Multi\-chunk selection is essential, as single\-chunk prefilling fails at all long contexts\.*

See Appendix[C](https://arxiv.org/html/2606.10537#A3)for additional ablations\.

## 7Conclusion

We presentedPrefilling\-dLLM, a prefill\-decode disaggregation framework for dLLMs that caches chunked prefix KV once and retrieves the top\-KKchunks for decoding, achieving 9\.1–28\.0×\\timesspeedup at 8K–32K contexts\. Our analysis reveals that chunk\-level BOS tokens act as periodic attention anchors that eliminate the lost\-in\-the\-middle phenomenon, and that multi\-chunk prefilling enables extrapolation to 128K tokens with over 92% exact\-match accuracy on retrieval\-based QA\.

## Limitations

Our chunk selection is static: the top\-KKchunks are fixed after prefill with no dynamic re\-selection during decoding, so inaccurate pseudo\-labels may cause relevant context to be missed\. The chunk sizeCCandKKrequire task\-specific tuning, as smaller chunks improve accuracy but underutilize GPU compute\. Additionally, FlexAttention lacks paged memory management, requiring the prefix KV cache to be reloaded at every decoding step\. Finally, we evaluate only on Dream\-7B and UltraLLaDA with English benchmarks; generalization to other dLLM architectures, larger scales, or multilingual settings remains to be verified\.

## Use of AI Assistants

We used AI writing assistants solely for language polishing and proofreading\. All research ideas, experimental design, implementation, and scientific conclusions are entirely the authors’ own work\.

## References

- Training\-free long\-context scaling of large language models\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. Van Den Berg \(2021\)Structured denoising diffusion models in discrete state\-spaces\.Advances in neural information processing systems34,pp\. 17981–17993\.Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou,et al\.\(2024\)Longbench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 3119–3137\.Cited by:[§6\.1](https://arxiv.org/html/2606.10537#S6.SS1.SSS0.Px1.p1.1)\.
- T\. Bie, M\. Cao, K\. Chen, L\. Du, M\. Gong, Z\. Gong, Y\. Gu, J\. Hu, Z\. Huang, Z\. Lan, C\. Li, C\. Li, J\. Li, Z\. Li, H\. Liu, L\. Liu, G\. Lu, X\. Lu, Y\. Ma, J\. Tan, L\. Wei, J\. Wen, Y\. Xing, X\. Zhang, J\. Zhao, D\. Zheng, J\. Zhou, J\. Zhou, Z\. Zhou, L\. Zhu, and Y\. Zhuang \(2025\)LLaDA2\.0: scaling up diffusion language models to 100b\.External Links:2512\.15745,[Link](https://arxiv.org/abs/2512.15745)Cited by:[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1)\.
- T\. Dao, D\. Y\. Fu, S\. Ermon, A\. Rudra, and C\. Ré \(2022\)FlashAttention: fast and memory\-efficient exact attention with IO\-awareness\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§6\.3](https://arxiv.org/html/2606.10537#S6.SS3.SSS0.Px1.p1.3)\.
- J\. Dong, B\. Feng, D\. Guessous, Y\. Liang, and H\. He \(2024\)Flex Attention: a programming model for generating optimized attention kernels\.arXiv preprint arXiv:2412\.05496\.External Links:[Link](https://arxiv.org/abs/2412.05496)Cited by:[§6\.3](https://arxiv.org/html/2606.10537#S6.SS3.SSS0.Px1.p1.3)\.
- S\. Gong, S\. Agarwal, Y\. Zhang, J\. Ye, L\. Zheng, M\. Li, C\. An, P\. Zhao, W\. Bi, J\. Han, H\. Peng, and L\. Kong \(2025\)Scaling diffusion language models via adaptation from autoregressive models\.External Links:2410\.17891,[Link](https://arxiv.org/abs/2410.17891)Cited by:[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1),[§3](https://arxiv.org/html/2606.10537#S3.p1.5)\.
- S\. Gong, M\. Li, J\. Feng, Z\. Wu, and L\. Kong \(2022\)Diffuseq: sequence to sequence text generation with diffusion models\.arXiv preprint arXiv:2210\.08933\.Cited by:[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1)\.
- G\. He, S\. Nie, F\. Zhu, Y\. Zhao, T\. Bai, R\. Yan, J\. Fu, C\. Li, and B\. Yuan \(2025\)UltraLLaDA: scaling the context length to 128k for diffusion large language models\.External Links:2510\.10481,[Link](https://arxiv.org/abs/2510.10481)Cited by:[Appendix A](https://arxiv.org/html/2606.10537#A1.p1.3),[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.10537#S6.T1.7.1.12.12.1),[Table 2](https://arxiv.org/html/2606.10537#S6.T2.7.1.10.10.1)\.
- Z\. He, T\. Sun, Q\. Tang, K\. Wang, X\. Huang, and X\. Qiu \(2023\)Diffusionbert: improving generative masked language models with diffusion models\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 4521–4534\.Cited by:[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1)\.
- K\. Hong, L\. Chen, Z\. Wang, X\. Li, Q\. Mao, J\. Ma, C\. Xiong, G\. Wu, B\. Han, G\. Dai,et al\.\(2025\)Semi\-pd: towards efficient llm serving via phase\-wise disaggregated computation and unified storage\.arXiv preprint arXiv:2504\.19867\.Cited by:[§2\.3](https://arxiv.org/html/2606.10537#S2.SS3.p1.1)\.
- H\. Jiang, Y\. Li, C\. Zhang, Q\. Wu, X\. Luo, S\. Ahn, Z\. Han, A\. H\. Abdi, D\. Li, C\. Lin,et al\.\(2024\)Minference 1\.0: accelerating pre\-filling for long\-context llms via dynamic sparse attention\.Advances in Neural Information Processing Systems37,pp\. 52481–52515\.Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p3.8),[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- Y\. Jiang, Y\. Cai, X\. Luo, J\. Fu, J\. Wang, C\. Liu, and X\. Yang \(2025\)D2cache: accelerating diffusion\-based llms via dual adaptive caching\.External Links:2509\.23094,[Link](https://arxiv.org/abs/2509.23094)Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- X\. Lai, J\. Lu, Y\. Luo, Y\. Ma, and X\. Zhou \(2025\)Flexprefill: a context\-aware sparse attention mechanism for efficient long\-sequence inference\.arXiv preprint arXiv:2502\.20766\.Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p3.8),[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- X\. Li, J\. Thickstun, I\. Gulrajani, P\. S\. Liang, and T\. B\. Hashimoto \(2022\)Diffusion\-lm improves controllable text generation\.Advances in neural information processing systems35,pp\. 4328–4343\.Cited by:[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1)\.
- Y\. Li, Y\. Huang, B\. Yang, B\. Venkitesh, A\. Locatelli, H\. Ye, T\. Cai, P\. Lewis, and D\. Chen \(2024\)Snapkv: llm knows what you are looking for before generation\.Advances in Neural Information Processing Systems37,pp\. 22947–22970\.Cited by:[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§4\.1](https://arxiv.org/html/2606.10537#S4.SS1.p1.1),[§6\.4](https://arxiv.org/html/2606.10537#S6.SS4.p1.1)\.
- X\. Liu, Y\. Song, Z\. Liu, Z\. Huang, Q\. Guo, Z\. He, and X\. Qiu \(2025a\)LongLLaDA: unlocking long context capabilities in diffusion llms\.External Links:2506\.14429,[Link](https://arxiv.org/abs/2506.14429)Cited by:[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1)\.
- Z\. Liu, Y\. Yang, Y\. Zhang, J\. Chen, C\. Zou, Q\. Wei, S\. Wang, and L\. Zhang \(2025b\)DLLM\-cache: accelerating diffusion large language models with adaptive caching\.External Links:2506\.06295,[Link](https://arxiv.org/abs/2506.06295)Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p2.1)\.
- L\. Long, Y\. Huang, S\. Bai, R\. Gong, J\. Zhang, A\. Zhou, and J\. Yang \(2026\)Focus\-dllm: accelerating long\-context diffusion llm inference via confidence\-guided context focusing\.arXiv preprint arXiv:2602\.02159\.Cited by:[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- X\. Ma, R\. Yu, G\. Fang, and X\. Wang \(2026\)Dkv\-cache: the cache for diffusion language models\.Advances in Neural Information Processing Systems38,pp\. 149009–149033\.Cited by:[Appendix B](https://arxiv.org/html/2606.10537#A2.p1.1),[§1](https://arxiv.org/html/2606.10537#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- Q\. Nguyen\-Tri, M\. Ranjan, and Z\. Shen \(2025\)Attention is all you need for kv cache in diffusion llms\.External Links:2510\.14973,[Link](https://arxiv.org/abs/2510.14973)Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p2.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.External Links:2502\.09992,[Link](https://arxiv.org/abs/2502.09992)Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1)\.
- B\. Peng, J\. Quesnelle, H\. Fan, and E\. Shao \(2023\)YaRN: efficient context window extension of large language models\.arXiv preprint arXiv:2309\.00071\.Cited by:[Appendix B](https://arxiv.org/html/2606.10537#A2.p1.1)\.
- R\. Qin, Z\. Li, W\. He, J\. Cui, H\. Tang, F\. Ren, T\. Ma, S\. Cai, Y\. Zhang, M\. Zhang,et al\.\(2024\)Mooncake: a kvcache\-centric disaggregated architecture for llm serving\.ACM Transactions on Storage\.Cited by:[§2\.3](https://arxiv.org/html/2606.10537#S2.SS3.p1.1)\.
- M\. N\. Rabe and C\. Staats \(2021\)Self\-attention does not needO\(n2\)O\(n^\{2\}\)memory\.arXiv preprint arXiv:2112\.05682\.Cited by:[§6\.3](https://arxiv.org/html/2606.10537#S6.SS3.SSS0.Px1.p1.3)\.
- S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. Chiu, A\. Rush, and V\. Kuleshov \(2024\)Simple and effective masked diffusion language models\.Advances in Neural Information Processing Systems37,pp\. 130136–130184\.Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1),[§3](https://arxiv.org/html/2606.10537#S3.p1.5)\.
- Y\. Song, X\. Liu, R\. Li, Z\. Liu, Z\. Huang, Q\. Guo, Z\. He, and X\. Qiu \(2025\)Sparse\-dllm: accelerating diffusion llms with dynamic cache eviction\.External Links:2508\.02558,[Link](https://arxiv.org/abs/2508.02558)Cited by:[Appendix B](https://arxiv.org/html/2606.10537#A2.p1.1),[§1](https://arxiv.org/html/2606.10537#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- X\. Wang, C\. Xu, Y\. Jin, J\. Jin, H\. Zhang, and Z\. Deng \(2025a\)Diffusion llms can do faster\-than\-ar inference via discrete diffusion forcing\.External Links:2508\.09192,[Link](https://arxiv.org/abs/2508.09192)Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p1.1)\.
- Z\. Wang, G\. Fang, X\. Ma, X\. Yang, and X\. Wang \(2025b\)SparseD: sparse attention for diffusion language models\.arXiv preprint arXiv:2509\.24014\.Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- C\. Wu, H\. Zhang, S\. Xue, S\. Diao, Y\. Fu, Z\. Liu, P\. Molchanov, P\. Luo, S\. Han, and E\. Xie \(2025a\)Fast\-dllm v2: efficient block\-diffusion llm\.arXiv preprint arXiv:2509\.26328\.Cited by:[Appendix A](https://arxiv.org/html/2606.10537#A1.p1.3),[Appendix B](https://arxiv.org/html/2606.10537#A2.p1.1),[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie \(2025b\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.External Links:2505\.22618,[Link](https://arxiv.org/abs/2505.22618)Cited by:[Appendix B](https://arxiv.org/html/2606.10537#A2.p1.1),[§1](https://arxiv.org/html/2606.10537#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- H\. Xi, H\. Singh, Y\. Hu, C\. Hooper, R\. Tiwari, A\. Tomar, M\. Lee, W\. Kang, M\. Mahoney, C\. Xu,et al\.\(2026\)LoSA: locality aware sparse attention for block\-wise diffusion language models\.arXiv preprint arXiv:2604\.12056\.Cited by:[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NG7sS51zVF)Cited by:[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1),[§6\.5](https://arxiv.org/html/2606.10537#S6.SS5.p1.1.1),[§6\.5](https://arxiv.org/html/2606.10537#S6.SS5.p4.1)\.
- R\. Xu, G\. Xiao, H\. Huang, J\. Guo, and S\. Han \(2025\)XAttention: block sparse attention with antidiagonal scoring\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=KG6aBfGi6e)Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p3.8),[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025a\)Dream 7b: diffusion large language models\.External Links:2508\.15487,[Link](https://arxiv.org/abs/2508.15487)Cited by:[Appendix A](https://arxiv.org/html/2606.10537#A1.p1.3),[§1](https://arxiv.org/html/2606.10537#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1),[§3](https://arxiv.org/html/2606.10537#S3.p1.5),[Table 1](https://arxiv.org/html/2606.10537#S6.T1.7.1.3.3.1),[Table 2](https://arxiv.org/html/2606.10537#S6.T2.7.1.2.2.1)\.
- Z\. Ye, L\. Chen, R\. Lai, W\. Lin, Y\. Zhang, S\. Wang, T\. Chen, B\. Kasikci, V\. Grover, A\. Krishnamurthy,et al\.\(2025b\)Flashinfer: efficient and customizable attention engine for llm inference serving\.Proceedings of Machine Learning and Systems7\.Cited by:[§6\.3](https://arxiv.org/html/2606.10537#S6.SS3.SSS0.Px1.p1.3)\.
- Z\. You, S\. Nie, X\. Zhang, J\. Hu, J\. Zhou, Z\. Lu, J\. Wen, and C\. Li \(2025\)Llada\-v: large language diffusion models with visual instruction tuning\.arXiv preprint arXiv:2505\.16933\.Cited by:[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1)\.
- J\. Yuan, H\. Gao, D\. Dai, J\. Luo, L\. Zhao, Z\. Zhang, Z\. Xie, Y\. X\. Wei, L\. Wang, Z\. Xiao, Y\. Wang, C\. Ruan, M\. Zhang, W\. Liang, and W\. Zeng \(2025\)Native sparse attention: hardware\-aligned and natively trainable sparse attention\.External Links:2502\.11089,[Link](https://arxiv.org/abs/2502.11089)Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p3.8),[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- H\. Zhang, P\. Patel, A\. Ning, and D\. Wentzlaff \(2025\)SPAD: specialized prefill and decode hardware for disaggregated llm inference\.arXiv preprint arXiv:2510\.08544\.Cited by:[§2\.3](https://arxiv.org/html/2606.10537#S2.SS3.p1.1)\.
- X\. Zhang, Y\. Chen, S\. Hu, Z\. Xu, J\. Chen, M\. Hao, X\. Han, Z\. Thai, S\. Wang, Z\. Liu,et al\.\(2024\)∞\\inftybench: Extending long context evaluation beyond 100k tokens\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15262–15277\.Cited by:[§6\.1](https://arxiv.org/html/2606.10537#S6.SS1.SSS0.Px1.p1.1)\.
- Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. Barrett,et al\.\(2023\)H2o: heavy\-hitter oracle for efficient generative inference of large language models\.Advances in Neural Information Processing Systems36,pp\. 34661–34710\.Cited by:[§2\.2](https://arxiv.org/html/2606.10537#S2.SS2.p1.1)\.
- Y\. Zhong, S\. Liu, J\. Chen, J\. Hu, Y\. Zhu, X\. Liu, X\. Jin, and H\. Zhang \(2024\)\{\\\{distserve\}\\\}: Disaggregating prefill and decoding for goodput\-optimized large language model serving\.In18th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 24\),pp\. 193–210\.Cited by:[§1](https://arxiv.org/html/2606.10537#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.10537#S2.SS3.p1.1)\.
- F\. Zhu, R\. Wang, S\. Nie, X\. Zhang, C\. Wu, J\. Hu, J\. Zhou, J\. Chen, Y\. Lin, J\. Wen, and C\. Li \(2025\)LLaDA 1\.5: variance\-reduced preference optimization for large language diffusion models\.External Links:2505\.19223,[Link](https://arxiv.org/abs/2505.19223)Cited by:[§2\.1](https://arxiv.org/html/2606.10537#S2.SS1.p1.1)\.

## Appendix

## Appendix AImplementation Details

We implementPrefilling\-dLLMon top of three dLLM*base models*: Dream\-7B\(Yeet al\.,[2025a](https://arxiv.org/html/2606.10537#bib.bib24)\), UltraLLaDA\(Heet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib35)\), and Fast\-dLLM v2\(Wuet al\.,[2025a](https://arxiv.org/html/2606.10537#bib.bib51)\)\. We useTTdenoising steps and set the chunk sizeCCand the number of chunksKKbased on validation performance\. All experiments are conducted on NVIDIA A100 GPUs\.

#### Common Settings\.

AllPrefilling\-dLLMruns use bfloat16 inference and greedy decoding with temperature 0\. During prefill, we prepend a BOS token to every prefix chunk, use causal attention for chunk scoring, and score each chunk independently with the query and optional pseudo\-label window\. Unless otherwise stated, selected chunks are cached withfull\-maskKV construction, continuous chunk positions, and query positions placed after the selected chunks\. For sparse variants with intra\-chunk sparsity, we retain the top\-BBhighest\-scoring tokens per selected chunk and setB=512B=512in the main experiments\.

#### Prompting and Truncation\.

For LongBench, we use the original task\-specific prompt templates provided by the LongBench evaluation configuration, where each prompt is rendered by filling the\{context\}and\{input\}fields\. For InfiniteBench, we use the raw benchmark\-style prompts for each task, following the same structure of an instruction prefix, the long context, and a task query\. In both benchmarks,Prefilling\-dLLMseparates the rendered prompt into three parts, namely the instruction prefix, the long\-context field, and the query suffix\. Only the long\-context field is partitioned into chunks for chunk scoring, top\-KKretrieval, and optional top\-BBtoken retention; the instruction prefix and query suffix are kept outside the chunk pool and are always included in decoding\.

For vanilla and acceleration baselines that cannot process the full prompt within their context window, we apply context\-only head\-tail truncation: after rendering the same prompt template, we allocate a prompt budget ofmax\_lengthminus the generation length, keep the instruction prefix and query suffix unchanged, and truncate only the long\-context field by preserving equal\-length head and tail portions while dropping the middle\. This avoids removing task instructions or the question\. For Dream\-7B experiments that use YaRN extrapolation, the effective context window is expanded by the corresponding RoPE scaling factor; for UltraLLaDA, we use its native 128K context window\. UltraLLaDA main\-table experiments are evaluated without a chat template\.

Prompt example\.For LongBench MF\-en, the rendered prompt has the following structure:Read the following text and answer briefly\. Title: City Archive Renovation The archive reopened after a two\-year renovation\. The director noted that the new reading room would hold letters, maps, and oral histories from local residents\. \.\.\. In a later interview, the curator said the first public exhibit would focus on neighborhood transit records from 1912\. Now, answer the following question based on the above text, only give me the answer and do not output any other words\. Question: \[question\] Answer:

#### Dream\-v0\-Base\-7B\.

Dream\-v0\-Base\-7B has a native 2K context window, so the 128K long\-context rows use YaRN extrapolation with rope scale factor 64\. ForPrefilling\-dLLMwith inter\-chunk sparsity, we use chunk sizeC=1024C=1024and select top\-K=4K=4chunks\. Chunk ranking uses pseudo\-label scoring with 4 pseudo\-label tokens and one partial denoising round\. The main intra\-chunk sparsity setting keeps the same chunk\-selection configuration and adds top\-BBtoken retention withB=512B=512, using the bidirectional query–chunk attention score for token importance\.

#### UltraLLaDA\.

UltraLLaDA supports native 128K context length\. We evaluate it without a chat template in the main tables\. For the inter\-chunk sparsity setting, we useC=1024C=1024, select top\-K=2K=2chunks, and rank chunks by self\-information scoring with a query window of 64 tokens\. For the intra\-chunk sparsity setting, we useC=1024C=1024, select top\-K=8K=8chunks, and retain top\-B=512B=512tokens per chunk with the bidirectional query–chunk attention score\.

## Appendix BBaselines

We comparePrefilling\-dLLMagainst standard dLLM inference \(full\-attention at every denoising step\) and dLLM acceleration methods, including Fast\-dLLM\(Wuet al\.,[2025b](https://arxiv.org/html/2606.10537#bib.bib22)\), Fast\-dLLM v2\(Wuet al\.,[2025a](https://arxiv.org/html/2606.10537#bib.bib51)\), dKV\-Cache\(Maet al\.,[2026](https://arxiv.org/html/2606.10537#bib.bib5)\), and Sparse\-dLLM\(Songet al\.,[2025](https://arxiv.org/html/2606.10537#bib.bib28)\)\. We also include YaRN\(Penget al\.,[2023](https://arxiv.org/html/2606.10537#bib.bib69)\)as a context extrapolation baseline, which extends the native context window of Dream\-7B \(2K\) to 128K via RoPE scaling\.

## Appendix CDetailed Ablation Analysis

The ablations show thatPrefilling\-dLLMbenefits most from selecting a small set of informative chunks, using short pseudo\-labels for chunk scoring, and retaining a moderate top\-BBtoken budget under intra\-chunk sparsity\.

We provide detailed MF\-en ablations in Tables[3](https://arxiv.org/html/2606.10537#A3.T3)–[6](https://arxiv.org/html/2606.10537#A3.T6)\. We focus on MF\-en because it is a retrieval\-intensive LongBench task where long\-context models must identify a small amount of useful evidence from a large prefix\. This makes it a direct probe for both inter\-chunk sparsity and intra\-chunk sparsity\.

#### Ablation on top\-KK\.

The optimal number of selected chunks is small, but it depends on whether intra\-chunk sparsity is used\. Without intra\-chunk sparsity, Dream\-v0\-Base\-7B performs best with top\-K=4K=4, while UltraLLaDA prefers top\-K=2K=2on theo46o46subset\. With intra\-chunk sparsity, Dream still favors top\-K=4K=4, whereas UltraLLaDA improves with top\-K=8K=8, suggesting that token\-level compression can benefit from a broader candidate chunk pool\.

#### Ablation on chunk size\.

The best chunk granularity is model\-dependent\. Dream\-v0\-Base\-7B favorsC=1024C=1024, which balances retrieval resolution with enough local context inside each selected chunk\. UltraLLaDA, evaluated on theo46o46subset, performs best atC=2048C=2048in the top\-K=4K=4sweep, indicating that the native 128K model can benefit from slightly coarser chunks in this setting\.

#### Ablation on chunk BOS\.

The chunk\-BOS control shows that boundary handling matters, but its effect is not uniform\. On UltraLLaDA, keeping the chunk BOS improves MF\-en from 28\.06 to 29\.34 under the top\-K=4K=4,C=1024C=1024self\-information setting\. On Dream\-v0\-Base\-7B, the effect is mixed in the earlychunk\-queryroute: removing the chunk BOS helps with 16 pseudo\-labels, while keeping it is better with 4 pseudo\-labels\. We therefore keep chunk BOS in the final configuration\.

#### Ablation on chunk score\.

Pseudo\-label scoring is a stable way to rank chunks before final decoding\. On Dream\-v0\-Base\-7B, short pseudo\-label windows slightly improve over self\-information, while longer pseudo\-label windows do not consistently help\. On UltraLLaDA, one or two pseudo\-labels already improve over self\-information on theo46o46subset, and increasing the pseudo\-label window brings no additional gain\.

#### Ablation on partial rounds\.

Partial denoising rounds should be kept short\. Dream\-v0\-Base\-7B benefits from two partial rounds in the tested pseudo\-label settings, but additional rounds reduce the score\. UltraLLaDA shows an even stronger preference for early pseudo\-labels: one partial round is best across the tested pseudo\-label windows, and further rounds degrade performance, suggesting that repeated refinement can inject noise into the chunk\-ranking signal\.

#### Ablation on cache build\.

The cache construction strategy is important for Dream\-v0\-Base\-7B\.full\-maskKV construction substantially outperformschunk\-queryandchunk\-only, confirming that selected chunks should be cached under the same masking pattern used by the final decoding stage\.

#### Ablation on retained\-token budget\.

When intra\-chunk sparsity is enabled, both base models benefit from a larger retained\-token budget\. Increasing the budget fromB=256B=256toB=512B=512improves Dream\-v0\-Base\-7B and UltraLLaDA, supporting our choice of a moderate top\-BBbudget that preserves useful local evidence inside each selected chunk\.

#### Ablation on token\-retention score\.

For Dream\-v0\-Base\-7B, the bidirectional token\-retention score improves over the query\-to\-chunk score, supporting our design choice of measuring token importance using attention in both directions between query tokens and chunk tokens\. Overall, the ablations validate the main configuration used in our experiments: inter\-chunk sparsity selects a compact set of relevant chunks, while intra\-chunk sparsity preserves the most query\-relevant tokens with a fixed top\-BBbudget\.

Table 3:Dream\-v0\-Base\-7B MF\-en ablations without intra\-chunk sparsity on the LongBench full split \(n=150n=150\)\. Each block changes one design variable; shaded cells mark the variable under ablation\.Varianttop\-KKCCchunk BOSchunk score\# pseudo\-labelsroundsKV buildMF\-enAblation on top\-KKtop\-441024onpseudo\-label4–full\-mask46\.57top\-661024onpseudo\-label4–full\-mask45\.01Ablation on chunk sizechunk 102441024onpseudo\-label4–full\-mask41\.52chunk 190041900onpseudo\-label4–full\-mask37\.68chunk 5124512onpseudo\-label4–full\-mask33\.14Ablation on chunk BOS \(chunk\-query KV\)draft16 BOS on41024onpseudo\-label16–chunk\-query39\.32draft16 BOS off41024offpseudo\-label16–chunk\-query40\.81draft4 BOS on41024onpseudo\-label4–chunk\-query43\.39draft4 BOS off41024offpseudo\-label4–chunk\-query42\.73Ablation on chunk scorequery\-only41024onself\-info0–full\-mask46\.65pseudo\-241024onpseudo\-label21full\-mask46\.89pseudo\-441024onpseudo\-label41full\-mask46\.57pseudo\-841024onpseudo\-label81full\-mask46\.15pseudo\-1641024onpseudo\-label161full\-mask45\.99pseudo\-3241024onpseudo\-label321full\-mask46\.68Ablation on partial rounds \(2 pseudo\-labels\)round 141024onpseudo\-label21full\-mask46\.48round 241024onpseudo\-label22full\-mask46\.89Ablation on partial rounds \(4 pseudo\-labels\)round 241024onpseudo\-label42full\-mask47\.54round 341024onpseudo\-label43full\-mask46\.41round 441024onpseudo\-label44full\-mask46\.57Ablation on cache buildchunk\-query KV41024onpseudo\-label42chunk\-query41\.52chunk\-only KV41024onpseudo\-label42chunk\-only34\.13full\-mask KV41024onpseudo\-label42full\-mask46\.57Table 4:Dream\-v0\-Base\-7B MF\-en ablations with intra\-chunk sparsity on the LongBench full split \(n=150n=150\)\. Each block changes one design variable; shaded cells mark the variable under ablation\.Varianttop\-KKCCchunk score\# pseudo\-labelsroundsKV buildtop\-BBtoken scoreMF\-enAblation on top\-KKtop\-661024pseudo\-label42full\-mask512query\-to\-chunk45\.21top\-441024pseudo\-label42full\-mask512query\-to\-chunk47\.54top\-221024pseudo\-label42full\-mask512query\-to\-chunk46\.94Ablation on retained\-token budgetcap 25641024pseudo\-label42full\-mask256query\-to\-chunk44\.46cap 51241024pseudo\-label42full\-mask512query\-to\-chunk47\.54Ablation on token\-retention scorequery\-to\-chunk41024pseudo\-label42full\-mask512query\-to\-chunk47\.54bidirectional41024pseudo\-label42full\-mask512bidirectional47\.96Table 5:UltraLLaDA MF\-en ablations without intra\-chunk sparsity on theo46o46subset \(n=46n=46\)\. Each block changes one design variable; shaded cells mark the variable under ablation\.Varianttop\-KKCCchunk BOSchunk score\# pseudo\-labelsroundsKV buildMF\-en o46Ablation on top\-KK\(C=1024C=1024\)top\-221024onself\-info0–full\-mask34\.84top\-331024onself\-info0–full\-mask29\.08top\-441024onself\-info0–full\-mask29\.34top\-551024onself\-info0–full\-mask29\.19top\-661024onself\-info0–full\-mask28\.66top\-881024onself\-info0–full\-mask26\.29Ablation on chunk size \(top\-K=4K=4\)chunk 5124512onself\-info0–full\-mask26\.62chunk 102441024onself\-info0–full\-mask29\.34chunk 204842048onself\-info0–full\-mask31\.51chunk 409644096onself\-info0–full\-mask27\.34Ablation on chunk BOSchunk BOS on41024onself\-info0–full\-mask29\.34chunk BOS off41024offself\-info0–full\-mask28\.06Ablation on chunk scorequery\-only21024onself\-info0–full\-mask34\.84pseudo\-121024onpseudo\-label11full\-mask37\.22pseudo\-221024onpseudo\-label21full\-mask37\.17pseudo\-421024onpseudo\-label41full\-mask36\.17pseudo\-821024onpseudo\-label81full\-mask36\.39Ablation on partial rounds \(2 pseudo\-labels\)round 121024onpseudo\-label21full\-mask37\.17round 221024onpseudo\-label22full\-mask34\.75Ablation on partial rounds \(4 pseudo\-labels\)round 121024onpseudo\-label41full\-mask36\.17round 221024onpseudo\-label42full\-mask33\.53round 321024onpseudo\-label43full\-mask31\.53round 421024onpseudo\-label44full\-mask33\.07Ablation on partial rounds \(8 pseudo\-labels\)round 121024onpseudo\-label81full\-mask36\.39round 221024onpseudo\-label82full\-mask32\.35Table 6:UltraLLaDA MF\-en ablations with intra\-chunk sparsity on theo46o46subset \(n=46n=46\)\. Each block changes one design variable; shaded cells mark the variable under ablation\.Varianttop\-KKCCchunk score\# pseudo\-labelsroundsKV buildtop\-BBtoken scoreMF\-en o46Ablation on retained\-token budgetcap 25681024self\-info0–full\-mask256bidirectional26\.01cap 51281024self\-info0–full\-mask512bidirectional28\.75Ablation on top\-KKwith intra\-chunk sparsitytop\-441024self\-info0–full\-mask512bidirectional27\.71top\-661024self\-info0–full\-mask512bidirectional27\.62top\-881024self\-info0–full\-mask512bidirectional28\.75

## Appendix DPrefilling Efficiency Analysis

![Refer to caption](https://arxiv.org/html/2606.10537v1/x11.png)Figure 10:Latency breakdown ofPrefilling\-dLLMinto prefilling \(chunk scoring \+ cache build\) and decoding \(diffusion generation\)\. Decoding time remains constant \(∼\\sim0\.86s\) regardless of context length, while prefilling scales with input size\.To isolate the contribution of each phase, we separately measure the latency of*prefilling*\(chunk scoring \+ KV cache construction\) and*decoding*\(diffusion generation\) withinPrefilling\-dLLM\. As shown in Figure[10](https://arxiv.org/html/2606.10537#A4.F10), the decoding latency remains nearly constant \(∼\\sim0\.86s\) across all context lengths, since it always operates on the fixed compressed context \(∼\\sim4K tokens\)\. The prefilling cost grows with input length \(1\.66s at 8K, 2\.49s at 16K, 4\.14s at 32K\) as chunk scoring must attend over the full context\. Nevertheless, prefilling is a one\-time cost amortized over the entire generation, and the constant decoding time explains whyPrefilling\-dLLM’s speedup advantage widens at longer contexts\.
Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Similar Articles

LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection

Fast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM Inference

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

Submit Feedback

Similar Articles

LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
Fast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM Inference
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering