Dynamic Chunking for Diffusion Language Models

arXiv cs.CL Papers

Summary

This paper introduces Dynamic Chunking for Diffusion Language Models (DCDM), which replaces fixed positional blocks in block discrete diffusion with content-defined semantic chunks using a differentiable Chunking Attention mechanism, achieving consistent improvements across scales up to 1.5B parameters.

arXiv:2605.15676v1 Announce Type: new Abstract: Block discrete diffusion language models factorize a sequence autoregressively over fixed-size positional blocks, decoupling within-block parallel denoising from across-block conditioning. We argue that this rigid partition wastes structure already present in the sequence: blocks defined by position rather than by content separate semantically coherent tokens and group unrelated ones together. We introduce the \textbf{D}ynamic \textbf{C}hunking \textbf{D}iffusion \textbf{M}odel (DCDM), which replaces positional blocks with content-defined semantic chunks. At its core is Chunking Attention, a differentiable layer that routes tokens into $K$ clusters parameterized by learnable subspaces and shaped end-to-end by the diffusion objective. The resulting cluster assignments induce a chunk-causal attention mask under which a discrete diffusion denoiser factorizes the sequence likelihood autoregressively over semantic chunks, strictly generalizing block discrete diffusion. On downstream benchmarks at parameter scales up to 1.5B, DCDM consistently improves over both unstructured and positional-block diffusion baselines, with the advantage stable across scales and visible early in training.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:33 AM

# Dynamic Chunking for Diffusion Language Models
Source: [https://arxiv.org/html/2605.15676](https://arxiv.org/html/2605.15676)
Yichen Zhu CSE, HKUST yc\_zhu@zju\.edu\.cn&Xiaoming Shi†‡ Xiaohongshu Inc\. sxm728@hotmail\.com&Peng Zhao Alibaba group zhuyun\.zp@alibaba\-inc\.comWeiyu Chen CityUHK weiyu\.chen@cityu\.edu\.hk&Debing Zhang Xiaohongshu Inc\. dengyang@xiaohongshu\.com&James Kwok CSE, HKUST jamesk@cse\.ust\.hk

###### Abstract

Block discrete diffusion language models factorize a sequence autoregressively over fixed\-size positional blocks, decoupling within\-block parallel denoising from across\-block conditioning\. We argue that this rigid partition wastes structure already present in the sequence: blocks defined by position rather than by content separate semantically coherent tokens and group unrelated ones together\. We introduce theDynamicChunkingDiffusionModel \(DCDM\), which replaces positional blocks with content\-defined semantic chunks\. At its core is Chunking Attention, a differentiable layer that routes tokens intoKKclusters parameterized by learnable subspaces and shaped end\-to\-end by the diffusion objective\. The resulting cluster assignments induce a chunk\-causal attention mask under which a discrete diffusion denoiser factorizes the sequence likelihood autoregressively over semantic chunks, strictly generalizing block discrete diffusion\. On downstream benchmarks at parameter scales up to 1\.5B, DCDM consistently improves over both unstructured and positional\-block diffusion baselines, with the advantage stable across scales and visible early in training\.

## 1Introduction

22footnotetext:†\\daggerProject leader\.33footnotetext:‡\\ddaggerCorresponding author: sxm728@hotmail\.comDiffusion large language models \(dLLMs\) have recently emerged as a competitive paradigm for text generation, due to their ability to decode multiple tokens in parallel\. Open\-source masked diffusion language models \(MDLMs\)\[[34](https://arxiv.org/html/2605.15676#bib.bib3)\]such as LLaDA\[[30](https://arxiv.org/html/2605.15676#bib.bib13)\]and Dream\[[44](https://arxiv.org/html/2605.15676#bib.bib14)\]have achieved performance comparable to autoregressive models at similar scales, while proprietary models such as Gemini Diffusion\[[16](https://arxiv.org/html/2605.15676#bib.bib17)\]and Mercury\[[22](https://arxiv.org/html/2605.15676#bib.bib18)\]demonstrate substantially higher generation throughput\. A key ingredient behind this progress is block diffusion\[[2](https://arxiv.org/html/2605.15676#bib.bib4)\], which has become the dominant design for scalable diffusion\-based language modeling\.

Block diffusion combines the strengths of autoregressive and diffusion models\. It factorizes a sequence autoregressively over blocks, preserving causal conditioning across groups of tokens, while denoising all tokens inside each block bidirectionally and in parallel\. This design provides a practical compromise between the quality of autoregressive modeling and the efficiency of parallel diffusion sampling\. However, existing block diffusion language models \(BDLMs\)\[[2](https://arxiv.org/html/2605.15676#bib.bib4)\]define blocks by a fixed positional rule: a sequence is partitioned into contiguous segments of equal length\. This choice imposes a strong structural prior that is independent of the content of the sequence\.

We argue that fixed positional blocks are a limiting abstraction for language\. The dependencies that determine a token are often not aligned with local contiguity: an entity may govern distant mentions, a mathematical derivation may depend on earlier premises, and a code token may be constrained by scope or syntax several lines away\. Positional partitioning can therefore separate tokens that should be denoised jointly, while placing weakly related neighboring tokens in the same diffusion process\. In this case, the model inherits the block\-autoregressive factorization but applies it at a granularity not matched to the semantic structure of the sequence\.

To address this mismatch, we propose the*Dynamic Chunking Diffusion Model*\(DCDM\), which replaces fixed positional blocks with learned semantic chunks\. Rather than segmenting a sequence by position, DCDM clusters tokens according to representations produced inside the model\. The resulting chunks may be non\-contiguous, variable in size, and sequence\-dependent\. They serve the same role as blocks in block diffusion: tokens within a chunk are denoised bidirectionally in parallel, while chunks are ordered autoregressively through a chunk\-causal mask\. Thus, DCDM preserves the computational structure of block diffusion while making the unit of parallel denoising content\-adaptive\.

The core component of DCDM is*Chunking Attention*, a differentiable routing layer that assigns tokens to one ofKKchunks\. Direct point\-centroid clustering is unstable in high\-dimensional language\-model representations, as a few clusters can dominate early and starve the remaining ones of gradient signal\. We instead represent each chunk by a learnable low\-dimensional subspace and route tokens according to subspace alignment\[[32](https://arxiv.org/html/2605.15676#bib.bib7)\]\. A soft attention path places the chunking geometry directly on the gradient path of the diffusion objective, while the induced hard assignments define a semantic chunk\-causal attention mask\. This construction generalizes positional block diffusion, which is recovered when the learned chunks coincide with fixed contiguous blocks\.

The contributions of this paper are:

- •We introduce Chunking Attention, a subspace\-based differentiable routing mechanism that induces semantic chunk\-causal masks for diffusion language modeling\.
- •We develop DCDM, a diffusion language model that strictly generalizes BDLM and trains the chunking mechanism and denoiser end\-to\-end under the diffusion objective\.
- •We provide extensive empirical evidence that semantic chunking outperforms its positional counterpart on downstream tasks across general reasoning, mathematics, and code generation at both 0\.5B and 1\.5B scales, validating that block diffusion benefits substantially from content\-adaptive granularity\.

## 2Related Work

#### Diffusion Large Language Models\.

The landscape of language generation has long been dominated by autoregressive models\[[33](https://arxiv.org/html/2605.15676#bib.bib24),[7](https://arxiv.org/html/2605.15676#bib.bib26),[1](https://arxiv.org/html/2605.15676#bib.bib25),[4](https://arxiv.org/html/2605.15676#bib.bib28),[39](https://arxiv.org/html/2605.15676#bib.bib27),[25](https://arxiv.org/html/2605.15676#bib.bib30),[12](https://arxiv.org/html/2605.15676#bib.bib29)\]\. While celebrated for their high\-quality outputs, these models are fundamentally constrained by a sequential, token\-by\-token decoding process\[[19](https://arxiv.org/html/2605.15676#bib.bib32),[38](https://arxiv.org/html/2605.15676#bib.bib33)\]\. To alleviate these latency bottlenecks, dLLMs, a class of diffusion\-based frameworks specifically designed for the discrete data domain, have emerged as a compelling alternative\. By incorporating an absorbing state, e\.g\.,\[MASK\], to represent noise,Austinet al\.\[[3](https://arxiv.org/html/2605.15676#bib.bib2)\]laid the foundation for masked diffusion modeling\. This framework has been subsequently extended by a series of recent works\[[34](https://arxiv.org/html/2605.15676#bib.bib3),[37](https://arxiv.org/html/2605.15676#bib.bib12),[30](https://arxiv.org/html/2605.15676#bib.bib13),[44](https://arxiv.org/html/2605.15676#bib.bib14),[15](https://arxiv.org/html/2605.15676#bib.bib15)\]\. Notably, MDLM\[[34](https://arxiv.org/html/2605.15676#bib.bib3)\]is among the most widely adopted, offering a simple yet highly efficient training objective\. The LLaDA\[[30](https://arxiv.org/html/2605.15676#bib.bib13)\]series scales diffusion language models beyond 8 billion parameters, demonstrating performance comparable to, if not exceeding, that of autoregressive models of equivalent scale\.

#### Autoregression\-diffusion Hybrid Language Models\.

Recent works have explored integrating the computational efficiency of autoregressive models into diffusion\-based frameworks, particularly for complex tasks such as video synthesis\. A representative approach, BDLM\[[2](https://arxiv.org/html/2605.15676#bib.bib4)\], models semantic dependencies across blocks autoregressively while performing the denoising process independently within each block\. Fast\-dLLMs\[[42](https://arxiv.org/html/2605.15676#bib.bib16)\]employ techniques such as block\-wise prefix caching to achieve generation efficiency that substantially surpasses that of AR models, without compromising generation quality\. Another line of work attempts to relax the rigidity of fixed positional blocks at*inference time*\. AdaBlock\-dLLM\[[26](https://arxiv.org/html/2605.15676#bib.bib22)\]adaptively adjusts block boundaries during sampling using local denoising\-confidence signals\.

## 3Preliminary

### 3\.1Masked Diffusion Models

Masked Diffusion Language Models \(MDLMs\)\[[34](https://arxiv.org/html/2605.15676#bib.bib3)\]are a class of discrete diffusion models in which the absorbing distribution𝝅\\bm\{\\pi\}of the forward process is the point mass on a special mask token𝐦\\mathbf\{m\}\. Let𝐱∈𝒱L\\mathbf\{x\}\\in\\mathcal\{V\}^\{L\}denote a clean sequence of lengthLLdrawn from the data distribution, and let𝐳t\\mathbf\{z\}\_\{t\}denote its corrupted latent variable at timestept∈\[0,1\]t\\in\[0,1\]\. The forward process operates over continuous timet∈\[0,1\]t\\in\[0,1\]and replaces each token independently by𝐦\\mathbf\{m\}with probability1−αt1\-\\alpha\_\{t\}, whereαt\\alpha\_\{t\}is a predefined noise schedule strictly decreasing fromα0=1\\alpha\_\{0\}=1\(clean data\) toα1=0\\alpha\_\{1\}=0\(fully masked\)\. The reverse process is parameterized by a denoiser𝐱θ​\(𝐳t,t\)\\mathbf\{x\}\_\{\\theta\}\(\\mathbf\{z\}\_\{t\},t\)trained to predict clean data from the masked state, and its evidence lower bound \(ELBO\) simplifies to the per\-sample weighted cross\-entropy loss

ℒ​\(𝐱,θ\)=𝔼q​\(𝐳t∣𝐱\)​∫01αt′1−αt​∑ℓ:𝐳tℓ=𝐦log⁡⟨𝐱θ,ℓ​\(𝐳t1:L,t\),𝐱ℓ⟩​d​t,\\mathcal\{L\}\(\\mathbf\{x\},\\theta\)\\;=\\;\\mathbb\{E\}\_\{q\(\\mathbf\{z\}\_\{t\}\\mid\\mathbf\{x\}\)\}\\int\_\{0\}^\{1\}\\frac\{\\alpha\_\{t\}^\{\\prime\}\}\{1\-\\alpha\_\{t\}\}\\sum\_\{\\ell:\\,\\mathbf\{z\}\_\{t\}^\{\\ell\}=\\mathbf\{m\}\}\\log\\bigl\\langle\\mathbf\{x\}\_\{\\theta,\\ell\}\(\\mathbf\{z\}\_\{t\}^\{1:L\},t\),\\;\\mathbf\{x\}\_\{\\ell\}\\bigr\\rangle\\,\\mathrm\{d\}t,\(1\)whereαt′=d​αt/d​t\\alpha\_\{t\}^\{\\prime\}=\\mathrm\{d\}\\alpha\_\{t\}/\\mathrm\{d\}tand𝐱θ,ℓ\\mathbf\{x\}\_\{\\theta,\\ell\}is the predicted categorical distribution at positionℓ\\ell\.

### 3\.2Block Diffusion Models

Block Diffusion Language Models \(BDLMs\)\[[2](https://arxiv.org/html/2605.15676#bib.bib4)\]combine autoregressive and diffusion modelling by partitioning a sequence ofLLtokens intoKKcontiguous blocks of fixed lengthBB\(withL=K⋅BL=K\\cdot B\), performing discrete diffusion within each block while maintaining autoregressive dependencies across blocks\. The likelihood factorizes autoregressively over blocks,

log⁡pθ​\(𝐱\)=∑b=1Klog⁡pθ​\(𝐱b\|𝐱<b\),\\log p\_\{\\theta\}\(\\mathbf\{x\}\)\\;=\\;\\sum\_\{b=1\}^\{K\}\\log p\_\{\\theta\}\\\!\\left\(\\mathbf\{x\}^\{b\}\\,\\big\|\\,\\mathbf\{x\}^\{<b\}\\right\),\(2\)where𝐱b\\mathbf\{x\}^\{b\}denotes thebb\-th block ofBBtokens and𝐱<b\\mathbf\{x\}^\{<b\}denotes all preceding blocks\. Each conditionalpθ​\(𝐱b∣𝐱<b\)p\_\{\\theta\}\(\\mathbf\{x\}^\{b\}\\mid\\mathbf\{x\}^\{<b\}\)is modeled by a discrete diffusion process over theBBtokens of that block\. Applying the masked diffusion ELBO to each block yields the per\-sample training bound

log⁡pθ​\(𝐱\)≥ℒBD​\(𝐱,θ\):=∑b=1Kℒ​\(𝐱b,𝐱<b,θ\),\\log p\_\{\\theta\}\(\\mathbf\{x\}\)\\;\\geq\\;\\mathcal\{L\}\_\{\\text\{BD\}\}\(\\mathbf\{x\},\\theta\)\\;:=\\;\\sum\_\{b=1\}^\{K\}\\mathcal\{L\}\\\!\\left\(\\mathbf\{x\}^\{b\},\\mathbf\{x\}^\{<b\},\\theta\\right\),\(3\)By tuning the block sizeBB, BDLMs interpolate between pure diffusion \(B=LB=L,K=1K=1\) and autoregressive models \(B=1B=1,K=LK=L\)\. This design supports flexible\-length generation, KV caching across blocks, and parallel sampling within each block, while achieving state\-of\-the\-art perplexity among discrete diffusion models\.

![Refer to caption](https://arxiv.org/html/2605.15676v1/x1.png)Figure 1:Overview of DCDM\.Left:The denoiser stacksN−1N\{\-\}1DiT blocks on top of a single Chunking Attention layer that produces the content\-defined partition consumed by all downstream blocks\.Right:Operationally, the chunking attention takes the noisy input together with a noise mask and emits a per\-token cluster assignment \(color\-coded as Green, Orange, Yellow\), which induces the chunk\-causal attention mask used by every subsequent layer of the denoiser\.

## 4Methodology

In this section, we begin by introducing Chunking Attention in Section[4\.1](https://arxiv.org/html/2605.15676#S4.SS1)\. Building upon this foundational module, Section[4\.2](https://arxiv.org/html/2605.15676#S4.SS2)presents the Dynamic Chunking Diffusion Model \(DCDM\) and its training paradigm\. Finally, Section[4\.3](https://arxiv.org/html/2605.15676#S4.SS3)describes the load\-balancing mechanisms that keep the learned partition well\-conditioned throughout training\.

### 4\.1Chunking Attention

We add a new chunking attention layer that performs token\-level clustering of hidden states𝐇∈ℝL×d\\mathbf\{H\}\\in\\mathbb\{R\}^\{L\\times d\}intoKKgroups, replacing the positional partition of BDLMs\[[2](https://arxiv.org/html/2605.15676#bib.bib4)\]with one produced inside the model itself\. Existing block diffusion partitions a sequence into fixed\-size positional blocks and applies autoregressive conditioning across them, but the positional partition is rigid and cannot adapt to the semantic content of the sequence, so semantically related tokens routinely end up split across block boundaries while unrelated tokens are denoised in parallel\.

A natural realization of such a partition is to perform clustering inside an attention layer itself: learn a single point centroid𝒄k∈ℝd\\bm\{c\}\_\{k\}\\in\\mathbb\{R\}^\{d\}per cluster and route each token by its inner\-product similarity to𝒄k\\bm\{c\}\_\{k\}, an idea explored in prior work on attention\-based clustering\[[28](https://arxiv.org/html/2605.15676#bib.bib1)\]\. While this recipe has been demonstrated on low\-dimensional problems, we find it unstable in the high\-dimensional embedding spaces of modern language models \(see Appendix[E\.1](https://arxiv.org/html/2605.15676#A5.SS1)\)\. We therefore promote each cluster from a single point to a low\-dimensional subspace, adopting the design principle:Each cluster is characterized by a low\-dimensional subspace ofℝd\\mathbb\{R\}^\{d\}, and tokens are assigned to clusters according to their alignment with these subspaces\.

We parameterize the layer withKKlearnable matrices\{𝝁k\}k=1K\\\{\\bm\{\\mu\}\_\{k\}\\\}\_\{k=1\}^\{K\}, where each𝝁k∈ℝd×h\\bm\{\\mu\}\_\{k\}\\in\\mathbb\{R\}^\{d\\times h\}serves as the basis of one cluster\. Algebraically,𝝁k\\bm\{\\mu\}\_\{k\}acts as a projectionℝd↦ℝh\\mathbb\{R\}^\{d\}\\mapsto\\mathbb\{R\}^\{h\}that maps a token𝒙ℓ\\bm\{x\}\_\{\\ell\}to

𝒑k,ℓ=𝝁k⊤​𝒙ℓ∈ℝh\.\\bm\{p\}\_\{k,\\ell\}\\;=\\;\\bm\{\\mu\}\_\{k\}^\{\\top\}\\,\\bm\{x\}\_\{\\ell\}\\;\\in\\;\\mathbb\{R\}^\{h\}\.\(4\)Geometrically, the column span𝒮k:=col​\(𝝁k\)⊂ℝd\\mathcal\{S\}\_\{k\}:=\\mathrm\{col\}\(\\bm\{\\mu\}\_\{k\}\)\\subset\\mathbb\{R\}^\{d\}is anhh\-dimensional subspace associated with thekk\-th cluster, in the spirit of subspace clustering\[[32](https://arxiv.org/html/2605.15676#bib.bib7)\], a line of work with a long history of success in prior literature\[[21](https://arxiv.org/html/2605.15676#bib.bib8),[43](https://arxiv.org/html/2605.15676#bib.bib9)\]\. Under this view,𝒑k,ℓ\\bm\{p\}\_\{k,\\ell\}records the coordinates of𝒙ℓ\\bm\{x\}\_\{\\ell\}along thehhaxes spanning𝒮k\\mathcal\{S\}\_\{k\}, and‖𝒑k,ℓ‖\\\|\\bm\{p\}\_\{k,\\ell\}\\\|measures the alignment of𝒙ℓ\\bm\{x\}\_\{\\ell\}with𝒮k\\mathcal\{S\}\_\{k\}\. Tokens are free to vary along thesehhdirections without incurring a distance penalty, so intra\-cluster variability is absorbed by the geometry of the subspace itself\.

To isolate intra\-cluster interactions among tokens, the module computes, for each clustekk, a pairwise affinity matrix𝐀k∈ℝL×L\\mathbf\{A\}\_\{k\}\\in\\mathbb\{R\}^\{L\\times L\}whose entries are inner products of the projected tokens:

\[𝐀k\]ℓ,m=1h​𝒑k,ℓ⊤​𝒑k,m=1h​𝒙ℓ⊤​𝝁k​𝝁k⊤​𝒙m\.\[\\mathbf\{A\}\_\{k\}\]\_\{\\ell,m\}\\;=\\;\\frac\{1\}\{\\sqrt\{h\}\}\\,\\bm\{p\}\_\{k,\\ell\}^\{\\top\}\\,\\bm\{p\}\_\{k,m\}\\;=\\;\\frac\{1\}\{\\sqrt\{h\}\}\\,\\bm\{x\}\_\{\\ell\}^\{\\top\}\\,\\bm\{\\mu\}\_\{k\}\\,\\bm\{\\mu\}\_\{k\}^\{\\top\}\\,\\bm\{x\}\_\{m\}\.\(5\)The factor1/h1/\\sqrt\{h\}is the standard dot\-product scaling\[[40](https://arxiv.org/html/2605.15676#bib.bib21)\], applied at the projected dimensionhhrather thanddsince the inner product is taken inℝh\\mathbb\{R\}^\{h\}\. Sharing𝝁k\\bm\{\\mu\}\_\{k\}across the query and key sides turns the affinity into a bilinear gate:\[𝐀k\]ℓ,m\[\\mathbf\{A\}\_\{k\}\]\_\{\\ell,m\}activates only when𝒙ℓ\\bm\{x\}\_\{\\ell\}and𝒙m\\bm\{x\}\_\{m\}are simultaneously aligned with𝒮k\\mathcal\{S\}\_\{k\}, so pairs straddling different clusters are suppressed\.

TheKKper\-cluster softmax\-ed attention matrices\{𝐓k=softmax​\(𝐀k\)\}\\\{\\mathbf\{T\}\_\{k\}=\\mathrm\{softmax\}\(\\mathbf\{A\}\_\{k\}\)\\\}are aggregated and applied to the hidden states to produce the module output𝐘\\mathbf\{Y\}:

𝐘=𝐖O​\(1K​∑k=1K𝐓k\)​𝐖V​𝐇,\\mathbf\{Y\}\\;=\\;\\mathbf\{W\}\_\{O\}\\left\(\\frac\{1\}\{\\sqrt\{K\}\}\\sum\_\{k=1\}^\{K\}\\mathbf\{T\}\_\{k\}\\right\)\\mathbf\{W\}\_\{V\}\\,\\mathbf\{H\},\(6\)where𝐖V,𝐖O∈ℝd×d\\mathbf\{W\}\_\{V\},\\mathbf\{W\}\_\{O\}\\in\\mathbb\{R\}^\{d\\times d\}are learnable value/output projections independent of the clustering geometry, and1/K1/\\sqrt\{K\}stabilizes the variance of the combined operator\. The purpose of this aggregation is not merely to mix tokens, but to fold the clustering operation into the main computational path of the model: by routing the hidden states through∑k𝐓k\\sum\_\{k\}\\mathbf\{T\}\_\{k\}, the learnable matrices\{𝝁k\}\\\{\\bm\{\\mu\}\_\{k\}\\\}enter the gradient flow of the diffusion objective, and the cluster structure is learned end\-to\-end together with the denoiser rather than as a detached auxiliary module\.

The downstream layers operate on hard cluster identities: they consume an attention mask built fromcℓ∈\{1,…,K\}c\_\{\\ell\}\\in\\\{1,\\dots,K\\\}that restricts attention to within\-cluster tokens\. We score each token\-subspace pair by

rℓ,k=∥𝒑k,ℓ∥=∥𝝁k⊤​𝒙ℓ∥,cℓ=arg⁡maxk⁡rℓ,k,r\_\{\\ell,k\}\\;=\\;\\lVert\\bm\{p\}\_\{k,\\ell\}\\rVert\\;=\\;\\lVert\\bm\{\\mu\}\_\{k\}^\{\\top\}\\,\\bm\{x\}\_\{\\ell\}\\rVert,\\qquad c\_\{\\ell\}\\;=\\;\\arg\\max\_\{k\}r\_\{\\ell,k\},\(7\)which reuses the bilinear quantity that drives the soft aggregation\. The diagonal of𝐀k\\mathbf\{A\}\_\{k\}already containsrℓ,k2/hr\_\{\\ell,k\}^\{2\}/\\sqrt\{h\}, so hard routing and soft aggregation share a single geometry\{𝝁k\}\\\{\\bm\{\\mu\}\_\{k\}\\\}\. We treatcℓc\_\{\\ell\}as a non\-differentiable index\. Gradients to\{𝝁k\}\\\{\\bm\{\\mu\}\_\{k\}\\\}flow exclusively through the soft path∑k𝐓k\\sum\_\{k\}\\mathbf\{T\}\_\{k\}in Eq\. \([6](https://arxiv.org/html/2605.15676#S4.E6)\), which in turn shapes the geometry the hard router reads off\. The naivearg⁡max\\arg\\maxrule is prone to load imbalance across clusters, and we defer mitigations to Section[4\.3](https://arxiv.org/html/2605.15676#S4.SS3)\.

Overall, the module addsK​d​h\+2​d2Kdh\+2d^\{2\}parameters, theKKbasis matrices together with the value/output projections\. Hence, introducing chunking attention into a transformer incurs no significant parameter overhead\.

### 4\.2Dynamic Chunking Diffusion Language Model

The forward pass of DCDM proceeds in two stages, illustrated in Figure[1](https://arxiv.org/html/2605.15676#S3.F1): \(i\) a chunking stage, where the input embedding is passed through a single DiT block and read by the chunking attention layer \(Section[4\.1](https://arxiv.org/html/2605.15676#S4.SS1)\) to emit a per\-token cluster identifiercℓ∈\{1,…,K\}c\_\{\\ell\}\\in\\\{1,\\dots,K\\\}; and \(ii\) a denoising stage, where the remaining DiT blocks denoise the sequence under a chunk\-causal mask induced by these identifiers\.

To formalize this chunking mechanism, the cluster identifiers produced in the chunking stage serve directly as the chunk indices for each positionℓ\\ell\. The induced semantic chunks are defined as:

ℬk=\{ℓ∈\{1,…,L\}:cℓ=k\},k=1,…,K\.\\mathcal\{B\}\_\{k\}\\;=\\;\\bigl\\\{\\,\\ell\\in\\\{1,\\dots,L\\\}\\;:\\;c\_\{\\ell\}=k\\,\\bigr\\\},\\qquad k=1,\\dots,K\.\(8\)These chunks play the same role as the fixed positional blocks in BDLMs\[[2](https://arxiv.org/html/2605.15676#bib.bib4)\]\. However, rather than being contiguous or of uniform size, their compositions are dynamically determined per sequence by the learned centroid geometry\.

Based on these semantic chunks, the denoising stage factorizes the sequence likelihood autoregressively\. Writing𝐱\(k\)=\(xℓ\)ℓ∈ℬk\\mathbf\{x\}^\{\(k\)\}=\(x\_\{\\ell\}\)\_\{\\ell\\in\\mathcal\{B\}\_\{k\}\}for the tokens of thekk\-th chunk, we have:

pθ​\(𝐱\)=∏k=1Kpθ​\(𝐱\(k\)∣𝐱\(<k\)\),𝐱\(<k\):=⋃j<k𝐱\(j\),p\_\{\\theta\}\(\\mathbf\{x\}\)\\;=\\;\\prod\_\{k=1\}^\{K\}p\_\{\\theta\}\\\!\\left\(\\mathbf\{x\}^\{\(k\)\}\\mid\\mathbf\{x\}^\{\(<k\)\}\\right\),\\qquad\\mathbf\{x\}^\{\(<k\)\}:=\\bigcup\_\{j<k\}\\mathbf\{x\}^\{\(j\)\},\(9\)mirroring Eq\.\([2](https://arxiv.org/html/2605.15676#S3.E2)\) of BDLM but with content\-defined chunks \(the derivation and proof of the corresponding DCDM NELBO objective are provided in Appendix[C](https://arxiv.org/html/2605.15676#A3)\)\. Conceptually, each conditional is modeled as a discrete denoising diffusion process over the tokens ofℬk\\mathcal\{B\}\_\{k\}, governed by a base logical attention mask𝐌∈\{0,1\}L×L\\mathbf\{M\}\\in\\\{0,1\\\}^\{L\\times L\}, defined as𝐌ℓ,mchunk=𝕀​\[cm≤cℓ\]\\mathbf\{M\}^\{\\text\{chunk\}\}\_\{\\ell,m\}=\\mathbb\{I\}\\\!\\left\[\\,c\_\{m\}\\leq c\_\{\\ell\}\\,\\right\]\. This logical mask outlines the sequence\-level topology: bidirectional attention*within*a chunk and one\-way attention*across*chunks\.

To facilitate this conditioned denoising during training, we follow the dual\-stream design of BDLMs and feed DCDM a concatenated sequence𝐳t⊕𝐱\\mathbf\{z\}\_\{t\}\\oplus\\mathbf\{x\}\. Specifically,𝐱\(k\)\\mathbf\{x\}^\{\(k\)\}is corrupted by the forward processq​\(𝐳t\(k\)∣𝐱\(k\)\)q\(\\mathbf\{z\}^\{\(k\)\}\_\{t\}\\mid\\mathbf\{x\}^\{\(k\)\}\)at a sampled timestept∈\(0,1\]t\\in\(0,1\]while𝐱\(<k\)\\mathbf\{x\}^\{\(<k\)\}remains clean\. However, applying this dual\-stream architecture to our chunking framework introduces a subtle risk of information leakage\. If we naively allow𝐱\\mathbf\{x\}to perform full self\-attention during the chunking stage, the hidden state at each position of𝐱\\mathbf\{x\}would aggregate information from the entire clean sequence, including the very positions the denoiser must later predict\. Consequently, when𝐳t\\mathbf\{z\}\_\{t\}reads these𝐱\\mathbf\{x\}\-side hidden states for cross\-chunk teacher\-forcing, it would gain indirect access to its own ground\-truth\.

To resolve this leakage, we introduce a noise mask𝐌noise\\mathbf\{M\}^\{\\text\{noise\}\}that mirrors the noise pattern of𝐳t\\mathbf\{z\}\_\{t\}onto𝐱\\mathbf\{x\}, isolating the problematic positions\. Letνℓ∈\{0,1\}\\nu\_\{\\ell\}\\in\\\{0,1\\\}indicate whether positionℓ\\ellin𝐳t\\mathbf\{z\}\_\{t\}is currently masked; the sameνℓ\\nu\_\{\\ell\}is applied to the corresponding position in the𝐱\\mathbf\{x\}half\. The mask is defined as:

𝐌ℓ,mnoise=𝕀​\[νm=0\]∨𝕀​\[νℓ=1∧ℓ=m\]\.\\mathbf\{M\}^\{\\text\{noise\}\}\_\{\\ell,m\}\\;=\\;\\mathbb\{I\}\[\\nu\_\{m\}=0\]\\;\\lor\\;\\mathbb\{I\}\[\\nu\_\{\\ell\}=1\\,\\land\\,\\ell=m\]\.\(10\)Under this noise mask, positions with a clean𝐳t\\mathbf\{z\}\_\{t\}counterpart \(νℓ=0\\nu\_\{\\ell\}=0\) attend to all clean keys\. Conversely, positions with a masked counterpart \(νℓ=1\\nu\_\{\\ell\}=1\) can attend to all clean keys*plus*themselves \(ℓ=m\\ell=m\), but they are restricted from attending to any other masked keys\. Crucially, this ensures that an𝐱\\mathbf\{x\}position corresponding to a masked token does not aggregate information from the ground\-truth values of other masked positions\. As a result, when the denoiser later reads𝐱\\mathbf\{x\}for teacher forcing, no backdoor path exists for the masked tokens’ true values to travel back and influence the prediction\.

Similarly, to accommodate this concatenated input during training, the base logical mask𝐌chunk\\mathbf\{M\}^\{\\text\{chunk\}\}must be carefully expanded\. A detailed construction of the exact2​L×2​L2L\\times 2Ljoint attention mask required for this dual\-stream architecture is provided in Appendix[B](https://arxiv.org/html/2605.15676#A2)\.

### 4\.3Load Balancing

Without explicit intervention, the hard routing of Eq\.\([7](https://arxiv.org/html/2605.15676#S4.E7)\) can collapse into severely unbalanced configurations, a failure mode extensively documented in the mixture\-of\-experts literature\[[36](https://arxiv.org/html/2605.15676#bib.bib19),[13](https://arxiv.org/html/2605.15676#bib.bib20)\]\. Once a centroid loses early competition, the soft path provides it with vanishingly small gradient signal,𝝁k\\bm\{\\mu\}\_\{k\}stagnates, and the cluster never recovers\. At a longer timescale, even when individual sequences look well\-distributed, residual imbalance accumulated over optimizer steps drives the global load away from uniform usage\. We address these two failure modes with complementary mechanisms at two timescales: a*per\-sequence*auxiliary loss against intra\-sequence starvation, and, followingWanget al\.\[[41](https://arxiv.org/html/2605.15676#bib.bib11)\], a*global\-batch*bias correction that stabilizes load over the batch and across training steps\.

#### Per\-sequence load balancing\.

Within a single sequence, we encourage every centroid to receive a non\-trivial share of the tokens through an auxiliary loss applied to a differentiable hard sample of the routing scores\. For each token, we draw

𝐜~ℓ=GumbelSoftmaxST​\(𝒓ℓ\)∈\{0,1\}K,\\tilde\{\\mathbf\{c\}\}\_\{\\ell\}\\;=\\;\\mathrm\{GumbelSoftmax\}\_\{\\mathrm\{ST\}\}\(\\bm\{r\}\_\{\\ell\}\)\\in\\\{0,1\\\}^\{K\},\(11\)a one\-hot vector under the straight\-through estimator\[[20](https://arxiv.org/html/2605.15676#bib.bib10)\], whose forward value is a hard sample but whose backward path follows the softmax surrogate and thus carries gradients into\{𝝁k\}\\\{\\bm\{\\mu\}\_\{k\}\\\}\. With per\-sequence usage frequencyfb,k=1L​∑ℓ=1L\[𝐜~ℓ\]kf\_\{b,k\}=\\tfrac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\[\\tilde\{\\mathbf\{c\}\}\_\{\\ell\}\]\_\{k\}for sequencebbin a batch of sizeBB, we minimize

ℒchunk=−1B​K​∑b=1B∑k=1Klog⁡\(fb,k\+ε\),\\mathcal\{L\}\_\{\\mathrm\{chunk\}\}\\;=\\;\-\\frac\{1\}\{BK\}\\sum\_\{b=1\}^\{B\}\\sum\_\{k=1\}^\{K\}\\log\\bigl\(f\_\{b,k\}\+\\varepsilon\\bigr\),\(12\)whereε\\varepsilonis a small positive constant\. Under the simplex constraint∑kfb,k=1\\sum\_\{k\}f\_\{b,k\}=1, the loss attains its minimum when every sequence distributes its tokens uniformly across theKKclusters\. Becauseℒchunk\\mathcal\{L\}\_\{\\mathrm\{chunk\}\}is computed independently per sequence, it acts on the fastest available timescale, a single forward pass, and prevents any cluster from being starved within an individual example\.

#### Global\-batch load balancing\.

Even when each sequence uses allKKclusters, the overall load can still drift over the batch and across training steps\. Therefore, we add a non\-trainable per\-cluster bias𝐛∈ℝK\\mathbf\{b\}\\in\\mathbb\{R\}^\{K\}to the routing scores at the hard\-assignment step only:

cℓ=arg⁡maxk⁡\(rℓ,k\+bk\),c\_\{\\ell\}\\;=\\;\\arg\\max\_\{k\}\\,\\bigl\(r\_\{\\ell,k\}\+b\_\{k\}\\bigr\),\(13\)while the soft aggregation in Eq\.\([6](https://arxiv.org/html/2605.15676#S4.E6)\) continues to use the unbiased scores\. We track a running token countNk=∑ℓ𝕀​\[cℓ=k\]N\_\{k\}=\\sum\_\{\\ell\}\\mathbb\{I\}\[c\_\{\\ell\}=k\]for each centroid \(with totalN=∑kNkN=\\sum\_\{k\}N\_\{k\}\), and update the bias once per update interval by

bk←bk−ηb​\(Nk/N−1/K\),b\_\{k\}\\;\\leftarrow\\;b\_\{k\}\-\\eta\_\{b\}\\,\\bigl\(N\_\{k\}/N\-1/K\\bigr\),\(14\)whereηb\>0\\eta\_\{b\}\>0is a step size controlling how aggressively the bias reacts to load imbalance\. We use a fixedηb\\eta\_\{b\}throughout training\. The update lowers the bias of overloaded clusters and raises that of underused ones\. Because𝐛\\mathbf\{b\}enters only the discretearg⁡max\\arg\\maxbranch, no gradient flows through it: the soft path that trains\{𝝁k\}\\\{\\bm\{\\mu\}\_\{k\}\\\}is unaffected, and𝐛\\mathbf\{b\}acts purely as a control\-loop correction layered on top of the resulting hard assignments\.

## 5Experiments

### 5\.1Experimental Setup

![Refer to caption](https://arxiv.org/html/2605.15676v1/assets/images/steps_vs_loss.png)Figure 2:Training loss against training steps for the three dense diffusion models\. Final\-step values confirm the ordering observed on downstream metrics\.#### Datasets\.

All diffusion models are pretrained on OpenWebText\[[14](https://arxiv.org/html/2605.15676#bib.bib34)\]under a unified training protocol; full training details are deferred to Appendix[D](https://arxiv.org/html/2605.15676#A4)\. We then evaluate the pretrained LLMs on a suite of standard benchmarks grouped into three categories: general reasoning and knowledge \(ARC\-C\[[9](https://arxiv.org/html/2605.15676#bib.bib35)\],MMLU\[[18](https://arxiv.org/html/2605.15676#bib.bib36),[17](https://arxiv.org/html/2605.15676#bib.bib37)\], HellaSwag\[[45](https://arxiv.org/html/2605.15676#bib.bib38)\], TruthfulQA\[[24](https://arxiv.org/html/2605.15676#bib.bib41)\], WinoGrande\[[35](https://arxiv.org/html/2605.15676#bib.bib40)\], PIQA\[[6](https://arxiv.org/html/2605.15676#bib.bib39)\]\), mathematical reasoning \(MATH\[[23](https://arxiv.org/html/2605.15676#bib.bib42)\], GSM8K\[[10](https://arxiv.org/html/2605.15676#bib.bib43)\]\), and code generation \(HumanEval\[[5](https://arxiv.org/html/2605.15676#bib.bib44)\]\)\. Training dynamics for the dense diffusion models are summarized in Figure[2](https://arxiv.org/html/2605.15676#S5.F2)\. A complementary zero\-shot language\-modeling evaluation on seven held\-out corpora is reported in Appendix[E\.2](https://arxiv.org/html/2605.15676#A5.SS2)\.

#### Baselines\.

We compare DCDM against the dominant paradigms for discrete diffusion language modeling\. \(i\)MDLM\[[34](https://arxiv.org/html/2605.15676#bib.bib3)\]is a masked discrete diffusion language model that denoises tokens in parallel without block structure\. \(ii\)BDLM\[[2](https://arxiv.org/html/2605.15676#bib.bib4)\]is a block discrete diffusion model that imposes an autoregressive factorization over fixed\-size*positional*blocks, and is the closest prior work to ours\. \(iii\)AdaBlock\-dLLM\[[26](https://arxiv.org/html/2605.15676#bib.bib22)\]is a training\-free inference\-time scheduler that adapt block size during decoding on top of a model trained with fixed positional blocks\. We evaluate all baselines at matched parameter scale, and additionally include a sparseDCDM \(MoE\)variant whose active\-parameter budget \(0\.4B / 1\.2B\) is slightly below the dense models, to assess whether the semantic chunk structure of DCDM composes with conditional computation\. All dense diffusion models are evaluated at two scales, 0\.5B and 1\.5B parameters, with identical training data and tokenizer, under our unified evaluation protocol\.

#### Metrics\.

For the downstream benchmarks we follow each task’s standard metric\[[15](https://arxiv.org/html/2605.15676#bib.bib15)\]: accuracy for the multiple\-choice tasks \(ARC\-C, MMLU, HellaSwag, TruthfulQA, WinoGrande, PIQA\), exact\-match accuracy for MATH and GSM8K, and pass@1 for HumanEval\. The number of in\-context examples used for each benchmark is reported in parentheses in Table[1](https://arxiv.org/html/2605.15676#S5.T1); entries without a parenthesis are evaluated zero\-shot\.

Table 1:Downstream benchmark results for pretrained LLMs at the 0\.5B and 1\.5B scales\. Numbers in parentheses indicate the number of in\-context examples; entries without parentheses are zero\-shot\.†\\daggermarks benchmarks for which the model has been fine\-tuned on the corresponding training split\. For DCDM \(MoE\), theActive Paramsrow reports activated parameters per token with total parameters in parentheses; all other models activate all parameters at every token\. Within each scale, the best result per row is shown inboldand the second\-best isunderlined\.∗\\astAverage is computed over all nine reported benchmarks\.

### 5\.2Main Results

#### Downstream benchmarks\.

Table[1](https://arxiv.org/html/2605.15676#S5.T1)reports zero\- and few\-shot results across the downstream suite at both 0\.5B and 1\.5B scales\. Three patterns hold consistently across scales\.

First, DCDM outperforms the unstructured diffusion baseline MDLM on every benchmark except WinoGrande, where the two are within 1 point at both scales \(DCDM ahead by 0\.84 at 0\.5B, behind by 0\.40 at 1\.5B\)\. The largest gains are concentrated on HellaSwag \(\+9\.59\+9\.59/\+8\.23\+8\.23points at 0\.5B / 1\.5B\), GSM8K \(\+5\.05\+5\.05/\+5\.17\+5\.17\), and PIQA \(\+1\.74\+1\.74/\+4\.84\+4\.84\); smaller but consistent gains of0\.30\.3–2\.02\.0points appear on ARC\-C, MMLU, TruthfulQA, and HumanEval\. This indicates that the semantic chunk factorization introduced in Section[4\.2](https://arxiv.org/html/2605.15676#S4.SS2)provides a clear inductive advantage over parallel denoising without block structure\.

Second, DCDM also surpasses the positional\-block baseline BDLM on every benchmark except WinoGrande, where it lags by 2\.47 points at 0\.5B and 1\.26 points at 1\.5B\. Since BDLM shares the same block\-autoregressive factorization but uses fixed positional blocks, this gap is attributable to the content\-defined chunk partition of DCDM rather than to the block structure itself\.

Third, DCDM \(MoE\) yields a small but consistent improvement over dense DCDM, averaging\+0\.57\+0\.57points at 0\.5B and\+0\.33\+0\.33points at 1\.5B across the suite, at a slightly reduced active\-parameter budget \(0\.4B vs\. 0\.5B; 1\.2B vs\. 1\.5B\)\. The gain is modest but indicates that the semantic chunk structure of DCDM composes cleanly with sparse conditional computation rather than interfering with it\.

Relative orderings among the four models are largely preserved when scaling from 0\.5B to 1\.5B: DCDM \(MoE\) typically leads the suite, followed by dense DCDM, with WinoGrande the consistent exception in favor of BDLM\. The size of DCDM’s advantage over the diffusion baselines is benchmark\-dependent\. At the suite level, the overall gap between DCDM and the diffusion baselines is broadly preserved across scales\.

Table 2:Comparison against alternative block\-partitioning strategies at the 1\.5B scale on math \(MATH, GSM8K\) and code \(HumanEval\) benchmarks\. Models are grouped by the type of block structure used at sampling time\. Best per row inbold, second\-bestunderlined\.∗\\astAverage is computed over the three reported benchmarks\.
#### Comparison against adaptive block\-partitioning methods\.

We further compare DCDM against AdaBlock, two recent methods that adapt block size*at sampling time*on top of a model trained with fixed positional blocks\. As shown in Table[2](https://arxiv.org/html/2605.15676#S5.T2), DCDM attains the best dense average \(28\.1728\.17\), clearly above AdaBlock \(27\.2327\.23\); DCDM \(MoE\) further improves to28\.5228\.52at a smaller active\-parameter budget \(1\.21\.2B vs\.1\.51\.5B\)\. Beyond the headline numbers, these adaptive baselines adjust block size only at sampling time, so the partition seen at inference is not the one the model was optimized against; DCDM removes this train\-test mismatch by design\.

![Refer to caption](https://arxiv.org/html/2605.15676v1/assets/images/params_vs_acc.png)\(a\)Parameters vs\. accuracy\.
![Refer to caption](https://arxiv.org/html/2605.15676v1/assets/images/steps_vs_acc.png)\(b\)Training steps vs\. accuracy\.

Figure 3:Scaling and training efficiency of the three dense diffusion models \(MDLM, BDLM, DCDM\) on the downstream benchmark suite\.\(a\)Suite\-average accuracy at three parameter scales \(0\.1B, 0\.5B, 1\.5B\), each trained under our standard token budget\.\(b\)Suite\-average accuracy along the training trajectory at the 1\.5B scale; shaded regions denote one standard deviation across runs\.
#### Scaling and Training\.

Figure[3](https://arxiv.org/html/2605.15676#S5.F3)examines how the three dense diffusion models trade off accuracy against \(a\) parameter count and \(b\) the number of training steps consumed\.

Figure[3](https://arxiv.org/html/2605.15676#S5.F3)reports suite\-average accuracy at three parameter scales\. The orderingDCDM\>BDLM\>MDLM\\text\{DCDM\}\>\\text\{BDLM\}\>\\text\{MDLM\}holds at every scale we evaluate\. The gap between MDLM and the two block\-structured models widens with scale, indicating that block structure is not merely a small\-scale regularizer but contributes more as capacity grows\. The gap between DCDM and BDLM is smaller throughout but remains positive across scales, isolating the contribution of content\-defined, as opposed to merely positional, blocks once the model is large enough to exploit either form of partition\.

Figure[3](https://arxiv.org/html/2605.15676#S5.F3)traces accuracy along the training trajectory at the 1\.5B scale\. The three models start from comparable accuracy, but the orderingDCDM\>BDLM\>MDLM\\text\{DCDM\}\>\\text\{BDLM\}\>\\text\{MDLM\}emerges within the early portion of training and remains stable thereafter\. MDLM plateaus earliest and at the lowest level, while BDLM and DCDM continue to improve toward the end of the run\. DCDM matches MDLM’s final accuracy well before MDLM finishes training, and reaches BDLM’s final accuracy noticeably earlier than BDLM does, indicating genuinely faster optimization rather than convergence to a similar asymptote along a different path\. The reason is structural: routing tokens into content\-defined blocks supplies denoising targets that are coherent at the level of the partition, so the gradient signal at each step concentrates on within\-block dependencies rather than being diluted across unrelated positions\.

Table 3:Per\-benchmark results for the centroid\-count ablation \(K∈\{8,16,32,64\}K\\in\\\{8,16,32,64\\\}\) at the 0\.5B scale\. Best in each column inbold\.

### 5\.3Ablation Study: Number of Clusters

A central hyperparameter of DCDM is the number of clusersKKin the chunking attention layer, which determines the maximum granularity of the semantic partition\. We isolate its effect with an ablation at the 0\.5B scale, training DCDM withK∈\{8,16,32,64\}K\\in\\\{8,16,32,64\\\}under otherwise identical settings and evaluating on the general\-reasoning subset of the downstream benchmark suite\. Per\-benchmark numbers are reported in Table[3](https://arxiv.org/html/2605.15676#S5.T3)\.

On the suite average,K=16K=16achieves the best result \(40\.5840\.58\), ahead of the alternatives by a modest but consistent margin \(K=8K=8:40\.0140\.01;K=32K=32:40\.1140\.11;K=64K=64:40\.0640\.06\)\. The trend is non\-monotonic and exhibits a clear interior optimum: average performance peaks atK=16K=16and degrades at both extremes\. This U\-shape is consistent with the two failure modes that bracket the choice ofKK: too few clusers under\-partition the sequence and weaken the selectivity of the bilinear gate of Eq\. \([5](https://arxiv.org/html/2605.15676#S4.E5)\), pushing the soft aggregation toward the unstructured limit of MDLM; too many clusers over\-fragment the partition, leaving each cluster with too few tokens to support meaningful intra\-block bidirectional denoising and forcing the load\-balancing mechanisms of Section[4\.3](https://arxiv.org/html/2605.15676#S4.SS3)to operate near their stability margin\.

## 6Conclusion

We introduced DCDM, a discrete diffusion language model that replaces the fixed positional blocks of BDLM with content\-defined chunks produced by a learned subspace\-clustering attention layer\. DCDM matches BDLM in generation flexibility while consistently improving training loss, downstream accuracy, and zero\-shot perplexity at every scale we tested, and it admits MoE extensions for further gains\.

## References

- \[1\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]\(2025\)Block diffusion: interpolating between autoregressive and diffusion language models\.External Links:2503\.09573,[Link](https://arxiv.org/abs/2503.09573)Cited by:[Appendix C](https://arxiv.org/html/2605.15676#A3.p1.3),[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.SSS0.Px2.p1.1),[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.p1.1),[Table 6](https://arxiv.org/html/2605.15676#A5.T6),[§1](https://arxiv.org/html/2605.15676#S1.p1.1),[§1](https://arxiv.org/html/2605.15676#S1.p2.1),[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.15676#S3.SS2.p1.4),[§4\.1](https://arxiv.org/html/2605.15676#S4.SS1.p1.2),[§4\.2](https://arxiv.org/html/2605.15676#S4.SS2.p2.2),[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px2.p1.1)\.
- \[3\]J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. van den Berg\(2023\)Structured denoising diffusion models in discrete state\-spaces\.External Links:2107\.03006,[Link](https://arxiv.org/abs/2107.03006)Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[4\]J\. Bai, S\. Bai, Y\. Chu, Z\. Cui, K\. Dang, X\. Deng, Y\. Fan, W\. Ge, Y\. Han, F\. Huang, B\. Hui, L\. Ji, M\. Li, J\. Lin, R\. Lin, D\. Liu, G\. Liu, C\. Lu, K\. Lu, J\. Ma, R\. Men, X\. Ren, X\. Ren, C\. Tan, S\. Tan, J\. Tu, P\. Wang, S\. Wang, W\. Wang, S\. Wu, B\. Xu, J\. Xu, A\. Yang, H\. Yang, J\. Yang, S\. Yang, Y\. Yao, B\. Yu, H\. Yuan, Z\. Yuan, J\. Zhang, X\. Zhang, Y\. Zhang, Z\. Zhang, C\. Zhou, J\. Zhou, X\. Zhou, and T\. Zhu\(2023\)Qwen technical report\.arXiv preprint arXiv:2309\.16609\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[5\]M\. Bavarian, H\. Jun, N\. Tezak, J\. Schulman, C\. McLeavey, J\. Tworek, and M\. Chen\(2022\)Efficient training of language models to fill in the middle\.arXiv preprint arXiv:2207\.14255\.Cited by:[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[6\]Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)Piqa: reasoning about physical commonsense in natural language\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[7\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]C\. Chelba, T\. Mikolov, M\. Schuster, Q\. Ge, T\. Brants, P\. Koehn, and T\. Robinson\(2013\)One billion word benchmark for measuring progress in statistical language modeling\.PreprintTechnical ReportarXiv:1312\.3005\.Cited by:[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.p1.1)\.
- \[9\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv:1803\.05457v1\.Cited by:[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[10\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[11\]A\. Cohan, F\. Dernoncourt, D\. S\. Kim, T\. Bui, S\. Kim, W\. Chang, and N\. Goharian\(2018\)A discourse\-aware attention model for abstractive summarization of long documents\.PreprintTechnical ReportarXiv:1804\.05685\.Cited by:[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.p1.1)\.
- \[12\]G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]W\. Fedus, B\. Zoph, and N\. Shazeer\(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§4\.3](https://arxiv.org/html/2605.15676#S4.SS3.p1.1)\.
- \[14\]A\. Gokaslan and V\. Cohen\(2019\)OpenWebText corpus\.Note:[http://Skylion007\.github\.io/OpenWebTextCorpus](http://skylion007.github.io/OpenWebTextCorpus)Cited by:[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.p1.1),[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[15\]S\. Gong, S\. Agarwal, Y\. Zhang, J\. Ye, L\. Zheng, M\. Li, C\. An, P\. Zhao, W\. Bi, J\. Han,et al\.\(2024\)Scaling diffusion language models via adaptation from autoregressive models\.arXiv preprint arXiv:2410\.17891\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px3.p1.1)\.
- \[16\]Google DeepMind\(2025\)Gemini diffusion\.Note:Accessed: 2026\-04\-21External Links:[Link](https://deepmind.google/models/gemini-diffusion/)Cited by:[§1](https://arxiv.org/html/2605.15676#S1.p1.1)\.
- \[17\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Critch, J\. Li, D\. Song, and J\. Steinhardt\(2021\)Aligning ai with shared human values\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[18\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[19\]E\. Hoogeboom, A\. A\. Gritsenko, J\. Bastings, B\. Poole, R\. v\. d\. Berg, and T\. Salimans\(2021\)Autoregressive diffusion models\.arXiv preprint arXiv:2110\.02037\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]E\. Jang, S\. Gu, and B\. Poole\(2016\)Categorical reparameterization with gumbel\-softmax\.arXiv preprint arXiv:1611\.01144\.Cited by:[§4\.3](https://arxiv.org/html/2605.15676#S4.SS3.SSS0.Px1.p1.4)\.
- \[21\]Y\. Jiang, C\. Yu, Z\. Lin, and X\. Liu\(2025\)A mini\-batch training strategy for deep subspace clustering networks\.arXiv preprint arXiv:2507\.19917\.Cited by:[§4\.1](https://arxiv.org/html/2605.15676#S4.SS1.p3.17)\.
- \[22\]I\. Labs, S\. Khanna, S\. Kharbanda, S\. Li, H\. Varma, E\. Wang, S\. Birnbaum, Z\. Luo, Y\. Miraoui, A\. Palrecha,et al\.\(2025\)Mercury: ultra\-fast language models based on diffusion\.arXiv preprint arXiv:2506\.17298\.Cited by:[§1](https://arxiv.org/html/2605.15676#S1.p1.1)\.
- \[23\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[24\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)Truthfulqa: measuring how models mimic human falsehoods\.InProceedings of the 60th annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 3214–3252\.Cited by:[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[25\]A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[26\]G\. Lu, H\. M\. Chen, Y\. Karashima, Z\. Wang, D\. Fujiki, and H\. Fan\(2025\)Adablock\-dllm: semantic\-aware diffusion llm inference via adaptive block size\.arXiv preprint arXiv:2509\.26432\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px2.p1.1)\.
- \[27\]M\. P\. Marcus, B\. Santorini, and M\. A\. Marcinkiewicz\(1993\)Building a large annotated corpus of English: the Penn Treebank\.Computational Linguistics19\(2\),pp\. 313–330\.External Links:[Link](https://aclanthology.org/J93-2004/)Cited by:[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.p1.1)\.
- \[28\]R\. Maulen\-Soto, P\. Marion, and C\. Boyer\(2025\)Attention\-based clustering\.arXiv preprint arXiv:2505\.13112\.Cited by:[§4\.1](https://arxiv.org/html/2605.15676#S4.SS1.p2.3)\.
- \[29\]S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher\(2017\)Pointer sentinel mixture models\.InInternational Conference on Learning Representations,Cited by:[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.p1.1)\.
- \[30\]S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li\(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[§1](https://arxiv.org/html/2605.15676#S1.p1.1),[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]D\. Paperno, G\. Kruszewski, A\. Lazaridou, Q\. Pham, R\. Bernardi, S\. Pezzelle, M\. Baroni, G\. Boleda, and R\. Fernández\(2016\)The lambada dataset: word prediction requiring a broad discourse context\.In54th Annual Meeting of the Association for Computational Linguistics, ACL 2016\-Long Papers,Vol\.3,pp\. 1525–1534\.Cited by:[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.p1.1)\.
- \[32\]L\. Parsons, E\. Haque, and H\. Liu\(2004\-06\)Subspace clustering for high dimensional data: a review\.SIGKDD Explor\. Newsl\.6\(1\),pp\. 90–105\.External Links:ISSN 1931\-0145,[Link](https://doi.org/10.1145/1007730.1007731),[Document](https://dx.doi.org/10.1145/1007730.1007731)Cited by:[§1](https://arxiv.org/html/2605.15676#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.15676#S4.SS1.p3.17)\.
- \[33\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. T\. Chiu, A\. Rush, and V\. Kuleshov\(2024\)Simple and effective masked diffusion language models\.External Links:2406\.07524,[Link](https://arxiv.org/abs/2406.07524)Cited by:[Appendix C](https://arxiv.org/html/2605.15676#A3.p3.5),[Appendix C](https://arxiv.org/html/2605.15676#A3.p4.1),[Appendix C](https://arxiv.org/html/2605.15676#A3.p5.7),[Appendix C](https://arxiv.org/html/2605.15676#A3.p5.8),[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.SSS0.Px1.p1.1),[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.SSS0.Px2.p1.1),[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.p1.1),[§1](https://arxiv.org/html/2605.15676#S1.p1.1),[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.15676#S3.SS1.p1.13),[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px2.p1.1)\.
- \[35\]K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi\(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[36\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean\(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.arXiv preprint arXiv:1701\.06538\.Cited by:[§4\.3](https://arxiv.org/html/2605.15676#S4.SS3.p1.1)\.
- \[37\]J\. Shi, K\. Han, Z\. Wang, A\. Doucet, and M\. Titsias\(2024\)Simplified and generalized masked diffusion for discrete data\.Advances in neural information processing systems37,pp\. 103131–103167\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[38\]A\. Shih, D\. Sadigh, and S\. Ermon\(2022\)Training and inference on any\-order autoregressive models the right way\.Advances in Neural Information Processing Systems35,pp\. 2762–2775\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[39\]H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[40\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§4\.1](https://arxiv.org/html/2605.15676#S4.SS1.p4.11)\.
- \[41\]L\. Wang, H\. Gao, C\. Zhao, X\. Sun, and D\. Dai\(2024\)Auxiliary\-loss\-free load balancing strategy for mixture\-of\-experts\.arXiv preprint arXiv:2408\.15664\.Cited by:[§4\.3](https://arxiv.org/html/2605.15676#S4.SS3.p1.1)\.
- \[42\]C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie\(2025\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.arXiv preprint arXiv:2505\.22618\.Cited by:[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px2.p1.1)\.
- \[43\]W\. Wu, W\. Wang, and S\. Kong\(2023\)Deep structure and attention aware subspace clustering\.InChinese Conference on Pattern Recognition and Computer Vision \(PRCV\),pp\. 139–150\.Cited by:[§4\.1](https://arxiv.org/html/2605.15676#S4.SS1.p3.17)\.
- \[44\]J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong\(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[§1](https://arxiv.org/html/2605.15676#S1.p1.1),[§2](https://arxiv.org/html/2605.15676#S2.SS0.SSS0.Px1.p1.1)\.
- \[45\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\)Hellaswag: can a machine really finish your sentence?\.arXiv preprint arXiv:1905\.07830\.Cited by:[§5\.1](https://arxiv.org/html/2605.15676#S5.SS1.SSS0.Px1.p1.1)\.
- \[46\]X\. Zhang, J\. Zhao, and Y\. LeCun\(2015\)Character\-level convolutional networks for text classification\.Cited by:[§E\.2](https://arxiv.org/html/2605.15676#A5.SS2.p1.1)\.

## Appendix APseudocode

Algorithm[1](https://arxiv.org/html/2605.15676#alg1)reproduces the chunking attention layer of Section[4\.1](https://arxiv.org/html/2605.15676#S4.SS1)as a single\-batch computation\. The notation follows the main text:LLis the sequence length,ddthe model dimension,KKthe number of clusters, andhhthe per\-cluster subspace dimension\. The load\-balancing bias𝐛∈ℝK\\mathbf\{b\}\\in\\mathbb\{R\}^\{K\}is the running correction defined in Section[4\.3](https://arxiv.org/html/2605.15676#S4.SS3); it is updated externally once per optimizer interval and carries no gradient\.

Algorithm 1Chunking Attention \(single batch element\)\.1:input embedding

𝐇∈ℝL×d\\mathbf\{H\}\\in\\mathbb\{R\}^\{L\\times d\}; centroid projections

\{𝝁k\}k=1K\\\{\\bm\{\\mu\}\_\{k\}\\\}\_\{k=1\}^\{K\},

𝝁k∈ℝd×h\\bm\{\\mu\}\_\{k\}\\in\\mathbb\{R\}^\{d\\times h\}; value/output projections

𝐖V,𝐖O∈ℝd×d\\mathbf\{W\}\_\{V\},\\mathbf\{W\}\_\{O\}\\in\\mathbb\{R\}^\{d\\times d\}; load\-balancing bias

𝐛∈ℝK\\mathbf\{b\}\\in\\mathbb\{R\}^\{K\}
2:updated embedding

𝐘∈ℝL×d\\mathbf\{Y\}\\in\\mathbb\{R\}^\{L\\times d\}; hard cluster ids

𝐜∈\{1,…,K\}L\\mathbf\{c\}\\in\\\{1,\\dots,K\\\}^\{L\}; per\-sequence chunk loss

ℒchunk\\mathcal\{L\}\_\{\\mathrm\{chunk\}\}
3:// Soft path: project, score, softmax, aggregate\.

4:

𝒑k,ℓ←𝝁k⊤​𝒙ℓ\\bm\{p\}\_\{k,\\ell\}\\leftarrow\\bm\{\\mu\}\_\{k\}^\{\\top\}\\,\\bm\{x\}\_\{\\ell\}⊳\\triangleright𝒑k,ℓ∈ℝh\\bm\{p\}\_\{k,\\ell\}\\in\\mathbb\{R\}^\{h\}for allℓ,k\\ell,k

5:for

k=1,…,Kk=1,\\dots,Kdo

6:

\[𝐀k\]ℓ,m←𝒑k,ℓ⊤​𝒑k,m/h\[\\mathbf\{A\}\_\{k\}\]\_\{\\ell,m\}\\leftarrow\\bm\{p\}\_\{k,\\ell\}^\{\\\!\\top\}\\,\\bm\{p\}\_\{k,m\}/\\sqrt\{h\}
7:

𝐒k←softmax​\(𝐀k\)\\mathbf\{S\}\_\{k\}\\leftarrow\\mathrm\{softmax\}\(\\mathbf\{A\}\_\{k\}\)⊳\\trianglerightrow\-wise;𝐒k∈ℝL×L\\mathbf\{S\}\_\{k\}\\in\\mathbb\{R\}^\{L\\times L\}

8:endfor

9:

𝐘←𝐖O​\(1K​∑k𝐒k\)​𝐖V​𝐇\\mathbf\{Y\}\\leftarrow\\mathbf\{W\}\_\{O\}\\\!\\Big\(\\tfrac\{1\}\{\\sqrt\{K\}\}\\textstyle\\sum\_\{k\}\\mathbf\{S\}\_\{k\}\\Big\)\\\!\\mathbf\{W\}\_\{V\}\\,\\mathbf\{H\}
10:// Hard routing: read off the cluster id used by the mask\.

11:

rℓ,k←∥𝒑k,ℓ∥r\_\{\\ell,k\}\\leftarrow\\lVert\\bm\{p\}\_\{k,\\ell\}\\rVert⊳\\trianglerightreuses the diagonal of𝐀k\\mathbf\{A\}\_\{k\}

12:

cℓ←arg⁡maxk⁡\(rℓ,k\+bk\)c\_\{\\ell\}\\leftarrow\\arg\\max\_\{k\}\\,\(r\_\{\\ell,k\}\+b\_\{k\}\)⊳\\triangleright𝐛\\mathbf\{b\}updated externally, no gradient

13:// Per\-sequence load\-balancing loss\.

14:

𝐜~ℓ←GumbelSoftmaxST​\(𝒓ℓ\)\\tilde\{\\mathbf\{c\}\}\_\{\\ell\}\\leftarrow\\mathrm\{GumbelSoftmax\}\_\{\\mathrm\{ST\}\}\(\\bm\{r\}\_\{\\ell\}\)⊳\\triangleright𝒓ℓ=\(rℓ,1,…,rℓ,K\)\\bm\{r\}\_\{\\ell\}=\(r\_\{\\ell,1\},\\dots,r\_\{\\ell,K\}\); differentiable hard sample

15:

fk←1L​∑ℓ\[𝐜~ℓ\]kf\_\{k\}\\leftarrow\\tfrac\{1\}\{L\}\\textstyle\\sum\_\{\\ell\}\[\\tilde\{\\mathbf\{c\}\}\_\{\\ell\}\]\_\{k\}⊳\\trianglerightper\-sequence usage frequency of centroidkk

16:

ℒchunk←−1K​∑klog⁡\(fk\+ε\)\\mathcal\{L\}\_\{\\mathrm\{chunk\}\}\\leftarrow\-\\tfrac\{1\}\{K\}\\textstyle\\sum\_\{k\}\\log\(f\_\{k\}\+\\varepsilon\)
17:return

𝐘,𝐜,ℒchunk\\mathbf\{Y\},\\,\\mathbf\{c\},\\,\\mathcal\{L\}\_\{\\mathrm\{chunk\}\}

The algorithm splits into three logical blocks\. The*soft path*projects𝐇\\mathbf\{H\}onto every cluster subspace, forms the per\-centroid affinity matrices𝐀k\\mathbf\{A\}\_\{k\}and their row\-wise softmaxes𝐓k\\mathbf\{T\}\_\{k\}, and aggregates them into the module output𝐘\\mathbf\{Y\}; this is the only block that carries gradients to the clusers\{𝝁k\}\\\{\\bm\{\\mu\}\_\{k\}\\\}\. The*hard routing*block reads off the cluster identitycℓc\_\{\\ell\}from the alignment scoresrℓ,kr\_\{\\ell,k\}after adding the load\-balancing bias𝐛\\mathbf\{b\}; the resulting𝐜\\mathbf\{c\}is the index the downstream attention mask𝐌chunk\\mathbf\{M\}^\{\\text\{chunk\}\}is built from\. The*load\-balancing*block draws a straight\-through Gumbel\-softmax sample𝐜~ℓ\\tilde\{\\mathbf\{c\}\}\_\{\\ell\}and uses it to compute the per\-sequence usage frequencyfkf\_\{k\}and the auxiliary lossℒchunk\\mathcal\{L\}\_\{\\mathrm\{chunk\}\}, which back\-propagates into\{𝝁k\}\\\{\\bm\{\\mu\}\_\{k\}\\\}alongside the diffusion objective\.

## Appendix BAttention Mask Construction

#### Noise mask\.

Letνℓ∈\{0,1\}\\nu\_\{\\ell\}\\in\\\{0,1\\\}indicate whether positionℓ\\ell\(taken moduloLLon the doubled sequence\) is a masked token at the current diffusion timestep\. The noise mask is

𝐌ℓ,mnoise=𝕀​\[νm=0\]∨𝕀​\[νℓ=1∧ℓ=m\]\.\\mathbf\{M\}^\{\\text\{noise\}\}\_\{\\ell,m\}\\;=\\;\\mathbb\{I\}\[\\nu\_\{m\}=0\]\\;\\lor\\;\\mathbb\{I\}\[\\nu\_\{\\ell\}=1\\,\\land\\,\\ell=m\]\.\(15\)A clean query attends only to clean keys; a noisy query attends to all clean keys plus its own position\. The first clause prevents the information\-less mask\-token embedding from contaminating the representations of clean tokens, and the second clause is a self\-loop that keeps each noisy position identifiable through the layer\.

#### Chunk mask\.

At inference time the chunk mask reduces directly to the chunk\-causal mask,

𝐌ℓ,mchunk=𝕀​\[cm≤cℓ\],\\mathbf\{M\}^\{\\text\{chunk\}\}\_\{\\ell,m\}\\;=\\;\\mathbb\{I\}\[c\_\{m\}\\leq c\_\{\\ell\}\],\(16\)wherecℓc\_\{\\ell\}is the cluster index of positionℓ\\ell\. At training time, the doubled sequence requires three explicit cases\. Letσℓ∈\{0,1\}\\sigma\_\{\\ell\}\\in\\\{0,1\\\}indicate whether positionℓ\\elllies in the noisy half \(σℓ=1\\sigma\_\{\\ell\}=1iffℓ<L\\ell<L\), and letcℓc\_\{\\ell\}denote the cluster index of positionℓmodL\\ell\\bmod L\. The training\-time chunk mask is

𝐌ℓ,mchunk=\\displaystyle\\mathbf\{M\}^\{\\text\{chunk\}\}\_\{\\ell,m\}\\;=𝕀​\[σℓ=1,σm=1,cℓ=cm\]\\displaystyle\\mathbb\{I\}\[\\sigma\_\{\\ell\}=1,\\,\\sigma\_\{m\}=1,\\,c\_\{\\ell\}=c\_\{m\}\]\(17\)∨\\displaystyle\\lor𝕀​\[σℓ=1,σm=0,cℓ\>cm\]\\displaystyle\\mathbb\{I\}\[\\sigma\_\{\\ell\}=1,\\,\\sigma\_\{m\}=0,\\,c\_\{\\ell\}\>c\_\{m\}\]∨\\displaystyle\\lor𝕀​\[σℓ=0,σm=0,cℓ≥cm\]\.\\displaystyle\\mathbb\{I\}\[\\sigma\_\{\\ell\}=0,\\,\\sigma\_\{m\}=0,\\,c\_\{\\ell\}\\geq c\_\{m\}\]\.The three clauses cover, respectively: bidirectional attention within a chunk in the noisy half \(parallel denoising of co\-clustered tokens\); attention from a noisy query to a clean key in a strictly earlier chunk \(cross\-block conditioning, sourced from the clean half\); and the standard chunk\-causal mask within the clean half \(autoregressive teacher\-forcing\)\. Notably, no clause connects a clean query to a noisy key, so the autoregressive teacher\-forcing signal is never contaminated by noise\.

## Appendix CDCDM NELBO

We provide the NELBO derivation for DCDM\. The derivation follows BDLMs\[[2](https://arxiv.org/html/2605.15676#bib.bib4)\]step for step; the only substantive change is that DCDM factorizes the joint distribution over theKK*semantic*chunks\{ℬk\}k=1K\\\{\\mathcal\{B\}\_\{k\}\\\}\_\{k=1\}^\{K\}produced by the chunking attention module of Section[4\.1](https://arxiv.org/html/2605.15676#S4.SS1), with the cluster identifiercℓc\_\{\\ell\}used directly as the chunk index, rather than over fixed\-size positional blocks\.

Let𝐱1:L=\[𝐱1,…,𝐱L\]\\mathbf\{x\}\_\{1:L\}=\[\\mathbf\{x\}\_\{1\},\\ldots,\\mathbf\{x\}\_\{L\}\]be a sequence drawn from the data distributionq​\(𝐱\)q\(\\mathbf\{x\}\)\. Following the partition of Section[4\.2](https://arxiv.org/html/2605.15676#S4.SS2), we set

𝐱\(k\):=\(𝐱ℓ\)ℓ∈ℬk,𝐱\(<k\):=⋃j<k𝐱\(j\),k∈\{1,…,K\},\\mathbf\{x\}^\{\(k\)\}\\;:=\\;\(\\mathbf\{x\}\_\{\\ell\}\)\_\{\\ell\\in\\mathcal\{B\}\_\{k\}\},\\qquad\\mathbf\{x\}^\{\(<k\)\}\\;:=\\;\\bigcup\_\{j<k\}\\mathbf\{x\}^\{\(j\)\},\\qquad k\\in\\\{1,\\ldots,K\\\},with block sizeLk:=\|ℬk\|L\_\{k\}:=\|\\mathcal\{B\}\_\{k\}\|\. Unlike the positional blocks of BDLM,LkL\_\{k\}is not uniform; it is determined per sequence by the learned routing\. For brevity we abbreviate𝐱\(k\)\\mathbf\{x\}^\{\(k\)\}to𝐱k\\mathbf\{x\}^\{k\}throughout this appendix\. Each block undergoes diffusion overTTdiscretization steps, witht​\(i\)=i/Tt\(i\)=i/Tands​\(i\)=\(i−1\)/Ts\(i\)=\(i\-1\)/Tfori∈\[1,T\]i\\in\[1,T\]; letDKL​\[⋅\]\\mathrm\{D\}\_\{\\mathrm\{KL\}\}\[\\,\\cdot\\,\]denote the Kullback–Leibler divergence\. Then

−log⁡pθ​\(𝐱\)\\displaystyle\-\\log p\_\{\\theta\}\(\\mathbf\{x\}\)=−∑k=1Klog⁡pθ​\(𝐱k\|𝐱<k\)\\displaystyle=\-\\sum\_\{k=1\}^\{K\}\\log p\_\{\\theta\}\\bigl\(\\mathbf\{x\}^\{k\}\\,\\big\|\\,\\mathbf\{x\}^\{<k\}\\bigr\)=−∑k=1Klog⁡𝔼q​pθ​\(𝐱t​\(1\):t​\(T\)k\|𝐱<k\)q​\(𝐱t​\(1\):t​\(T\)k\|𝐱k\)\\displaystyle=\-\\sum\_\{k=1\}^\{K\}\\log\\mathbb\{E\}\_\{q\}\\frac\{p\_\{\\theta\}\\bigl\(\\mathbf\{x\}^\{k\}\_\{t\(1\):t\(T\)\}\\,\\big\|\\,\\mathbf\{x\}^\{<k\}\\bigr\)\}\{q\\bigl\(\\mathbf\{x\}^\{k\}\_\{t\(1\):t\(T\)\}\\,\\big\|\\,\\mathbf\{x\}^\{k\}\\bigr\)\}=−∑k=1Klog⁡𝔼q​pθ​\(𝐱t​\(T\)k\|𝐱<k\)​∏i=1Tpθ​\(𝐱s​\(i\)k\|𝐱t​\(i\)k,𝐱<k\)∏i=1Tq​\(𝐱t​\(i\)k\|𝐱s​\(i\)k\)\\displaystyle=\-\\sum\_\{k=1\}^\{K\}\\log\\mathbb\{E\}\_\{q\}\\frac\{p\_\{\\theta\}\\bigl\(\\mathbf\{x\}^\{k\}\_\{t\(T\)\}\\,\\big\|\\,\\mathbf\{x\}^\{<k\}\\bigr\)\\prod\_\{i=1\}^\{T\}p\_\{\\theta\}\\bigl\(\\mathbf\{x\}^\{k\}\_\{s\(i\)\}\\,\\big\|\\,\\mathbf\{x\}^\{k\}\_\{t\(i\)\},\\mathbf\{x\}^\{<k\}\\bigr\)\}\{\\prod\_\{i=1\}^\{T\}q\\bigl\(\\mathbf\{x\}^\{k\}\_\{t\(i\)\}\\,\\big\|\\,\\mathbf\{x\}^\{k\}\_\{s\(i\)\}\\bigr\)\}≤∑k=1K\[−𝔼q​log⁡pθ​\(𝐱k\|𝐱t=1/Tk,𝐱<k\)⏟ℒrecons\\displaystyle\\leq\\sum\_\{k=1\}^\{K\}\\Big\[\\underbrace\{\-\\mathbb\{E\}\_\{q\}\\log p\_\{\\theta\}\\bigl\(\\mathbf\{x\}^\{k\}\\,\\big\|\\,\\mathbf\{x\}^\{k\}\_\{t=1/T\},\\mathbf\{x\}^\{<k\}\\bigr\)\}\_\{\\mathcal\{L\}\_\{\\text\{recons\}\}\}\+𝔼t∈\{2/T,…,\(T−1\)/T,1\}𝔼qTDKL\(q\(𝐱sk\|𝐱tk,𝐱k\)∥pθ\(𝐱sk\|𝐱tk,𝐱<k\)\)⏟ℒdiffusion\\displaystyle\\quad\+\\underbrace\{\\mathbb\{E\}\_\{t\\in\\\{2/T,\\ldots,\(T\-1\)/T,1\\\}\}\\mathbb\{E\}\_\{q\}\\,T\\,\\mathrm\{D\}\_\{\\mathrm\{KL\}\}\\\!\\bigl\(q\(\\mathbf\{x\}^\{k\}\_\{s\}\|\\mathbf\{x\}^\{k\}\_\{t\},\\mathbf\{x\}^\{k\}\)\\,\\big\\\|\\,p\_\{\\theta\}\(\\mathbf\{x\}^\{k\}\_\{s\}\|\\mathbf\{x\}^\{k\}\_\{t\},\\mathbf\{x\}^\{<k\}\)\\bigr\)\}\_\{\\mathcal\{L\}\_\{\\text\{diffusion\}\}\}\+DKL​\(q​\(𝐱t=1k\|𝐱k\)∥pθ​\(𝐱t=1k\)\)⏟ℒprior\]\.\\displaystyle\\quad\+\\underbrace\{\\mathrm\{D\}\_\{\\mathrm\{KL\}\}\\\!\\bigl\(q\(\\mathbf\{x\}^\{k\}\_\{t=1\}\|\\mathbf\{x\}^\{k\}\)\\,\\big\\\|\\,p\_\{\\theta\}\(\\mathbf\{x\}^\{k\}\_\{t=1\}\)\\bigr\)\}\_\{\\mathcal\{L\}\_\{\\text\{prior\}\}\}\\Big\]\.\(18\)
We now specialize Eq\.\([18](https://arxiv.org/html/2605.15676#A3.E18)\) to the masked diffusion process used in our implementation\. In BDLM, the denoiser conditions on positionally earlier blocks; in DCDM, it conditions on semantically earlier chunks𝐱<k\\mathbf\{x\}^\{<k\}, with the conditioning realized operationally by the chunk\-causal attention mask𝐌chunk\\mathbf\{M\}^\{\\text\{chunk\}\}\. Since the SUBS parameterization ofSahooet al\.\[[34](https://arxiv.org/html/2605.15676#bib.bib3)\]constrainspθp\_\{\\theta\}token\-wise, enforcing zero masking probabilities and carry\-over unmasking on a per\-token basis, it does not depend on the block structure, and its derivation carries over verbatim once the variable\-length DCDM chunksℬk\\mathcal\{B\}\_\{k\}of sizeLkL\_\{k\}are substituted for the fixed\-size positional blocks of BDLM\.

FollowingSahooet al\.\[[34](https://arxiv.org/html/2605.15676#bib.bib3)\], the diffusion loss simplifies to

ℒdiffusion\\displaystyle\\mathcal\{L\}\_\{\\text\{diffusion\}\}=∑k=1K𝔼t𝔼qT\[∑ℓ=1LkDKL\(q\(𝐱sk,ℓ\|𝐱tk,ℓ,𝐱k,ℓ\)∥pθ\(𝐱sk,ℓ\|𝐱tk,𝐱<k\)\)\]\\displaystyle=\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\_\{t\}\\mathbb\{E\}\_\{q\}\\,T\\left\[\\sum\_\{\\ell=1\}^\{L\_\{k\}\}\\mathrm\{D\}\_\{\\mathrm\{KL\}\}\\\!\\bigl\(q\(\\mathbf\{x\}^\{k,\\ell\}\_\{s\}\|\\mathbf\{x\}^\{k,\\ell\}\_\{t\},\\mathbf\{x\}^\{k,\\ell\}\)\\,\\big\\\|\\,p\_\{\\theta\}\(\\mathbf\{x\}^\{k,\\ell\}\_\{s\}\|\\mathbf\{x\}^\{k\}\_\{t\},\\mathbf\{x\}^\{<k\}\)\\bigr\)\\right\]=∑k=1K𝔼t​𝔼q​T​\[αt−αs1−αt​log⁡pθ​\(𝐱k\|𝐱tk,𝐱<k\)\]\.\\displaystyle=\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\_\{t\}\\mathbb\{E\}\_\{q\}\\,T\\left\[\\frac\{\\alpha\_\{t\}\-\\alpha\_\{s\}\}\{1\-\\alpha\_\{t\}\}\\,\\log p\_\{\\theta\}\\bigl\(\\mathbf\{x\}^\{k\}\\,\\big\|\\,\\mathbf\{x\}^\{k\}\_\{t\},\\mathbf\{x\}^\{<k\}\\bigr\)\\right\]\.\(19\)
TakingT→∞T\\to\\inftywithT​\(αt−αs\)=αt′T\(\\alpha\_\{t\}\-\\alpha\_\{s\}\)=\\alpha^\{\\prime\}\_\{t\}yields the continuous\-time form

ℒdiffusion=∑k=1K𝔼t∼\[0,1\]​𝔼q​\[αt′1−αt​log⁡pθ​\(𝐱k\|𝐱tk,𝐱<k\)\]\.\\mathcal\{L\}\_\{\\text\{diffusion\}\}\\;=\\;\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\_\{t\\sim\[0,1\]\}\\mathbb\{E\}\_\{q\}\\left\[\\frac\{\\alpha^\{\\prime\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\,\\log p\_\{\\theta\}\\bigl\(\\mathbf\{x\}^\{k\}\\,\\big\|\\,\\mathbf\{x\}^\{k\}\_\{t\},\\mathbf\{x\}^\{<k\}\\bigr\)\\right\]\.\(20\)By the same arguments as inSahooet al\.\[[34](https://arxiv.org/html/2605.15676#bib.bib3), Suppl\. A\.2\.4\],ℒrecons=0\\mathcal\{L\}\_\{\\text\{recons\}\}=0in the continuous\-time limit because𝐱t​\(1\)k∼limT→∞Cat​\(⋅;𝐱t=1/Tk\)=Cat​\(⋅;𝐱k\)\\mathbf\{x\}^\{k\}\_\{t\(1\)\}\\sim\\lim\_\{T\\to\\infty\}\\mathrm\{Cat\}\(\\,\\cdot\\,;\\,\\mathbf\{x\}^\{k\}\_\{t=1/T\}\)=\\mathrm\{Cat\}\(\\,\\cdot\\,;\\,\\mathbf\{x\}^\{k\}\), andℒprior=0\\mathcal\{L\}\_\{\\text\{prior\}\}=0becauseαt=1=0\\alpha\_\{t=1\}=0ensuresq​\(𝐱t=1k\|𝐱k\)=Cat​\(⋅;𝐦\)=pθ​\(𝐱t=1k\)q\(\\mathbf\{x\}^\{k\}\_\{t=1\}\|\\mathbf\{x\}^\{k\}\)=\\mathrm\{Cat\}\(\\,\\cdot\\,;\\,\\mathbf\{m\}\)=p\_\{\\theta\}\(\\mathbf\{x\}^\{k\}\_\{t=1\}\)\. Consequently, the final DCDM diffusion objective is

ℒDCDM​\(𝐱;θ\)=∑k=1K𝔼t∼\[0,1\]​𝔼q​\[αt′1−αt​log⁡pθ​\(𝐱k\|𝐱tk,𝐱<k\)\],\\mathcal\{L\}\_\{\\text\{DCDM\}\}\(\\mathbf\{x\};\\theta\)\\;=\\;\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\_\{t\\sim\[0,1\]\}\\mathbb\{E\}\_\{q\}\\left\[\\frac\{\\alpha^\{\\prime\}\_\{t\}\}\{1\-\\alpha\_\{t\}\}\\,\\log p\_\{\\theta\}\\bigl\(\\mathbf\{x\}^\{k\}\\,\\big\|\\,\\mathbf\{x\}^\{k\}\_\{t\},\\mathbf\{x\}^\{<k\}\\bigr\)\\right\],\(21\)which is invariant to the choice of noise scheduleαt\\alpha\_\{t\}\[[34](https://arxiv.org/html/2605.15676#bib.bib3), Suppl\. E\.1\.1\]\.

## Appendix DImplementation Details

### D\.1Training Setup

All models are trained from scratch with the AdamW optimizer \(β1=0\.9\\beta\_\{1\}\{=\}0\.9,β2=0\.999\\beta\_\{2\}\{=\}0\.999,ϵ=10−8\\epsilon\{=\}10^\{\-8\}, weight decay0\.010\.01\) using a cosine learning\-rate schedule with2,5002\{,\}500linear\-warmup steps and a peak learning rate of3×10−43\\times 10^\{\-4\}\. The global training batch size and compute footprint scale with model size:0\.10\.1B models use a global batch size of128128on8×8\\timesH800 GPUs, while models of0\.50\.5B and above use a global batch size of512512on64×64\\timesH800 GPUs\. Gradients are clipped to a maximum norm of1\.01\.0\. We train for500,000500\{,\}000optimization steps inbfloat16mixed precision with TF32 matmul allowed\. Diffusion training uses a time\-sampling lower bound ofϵt=10−3\\epsilon\_\{t\}\{=\}10^\{\-3\}\. For DCDM variants the chunking auxiliary loss is weighted by10−210^\{\-2\}, and for the MoE variants the router auxiliary loss is also weighted by10−210^\{\-2\}\. A summary of the optimizer/training hyperparameters is given in Table[4](https://arxiv.org/html/2605.15676#A4.T4)\.

Table 4:Training hyperparameters shared across all models\.
### D\.2Model Architecture

All models share a qwen\-style decoder backbone with RMSNorm \(ε=10−6\\varepsilon\{=\}10^\{\-6\}\), SiLU\-gated MLPs, RoPE positional embeddings \(θ=106\\theta\{=\}10^\{6\}, max position40,96040\{,\}960\), grouped\-query attention without attention bias or dropout, and tied input/output embeddings \(untied for the1616B MoE variant\)\. The vocabulary size is50,26050\{,\}260for the WebText experiments and151,936151\{,\}936for the largest MoE configuration\. Initializer standard deviation is0\.020\.02\. Dense MDLM uses FlashAttention\-2; BDLM, DCDM and DCDM\-MoE use FlexAttention to support block\-/chunk\-structured masking\. BDLM applies block\-wise diffusion with block sizeB=8B=8\. DCDM partitions the sequence intoKKsemantic chunks via the chunking attention layer of Section[4\.1](https://arxiv.org/html/2605.15676#S4.SS1), which usesKKlearnable centroid projections of subspace dimensionhhtogether with the load\-balancing bias of Section[4\.3](https://arxiv.org/html/2605.15676#S4.SS3)\. DCDM\-MoE replaces every decoder MLP with a Qwen\-style sparse mixture of experts \(decoder\_sparse\_step=1\) with top\-kkrouting \(un\-normalized\) plus a shared expert\. The architectural hyperparameters of all configurations used in this work are listed in Table[5](https://arxiv.org/html/2605.15676#A4.T5)\.

Table 5:Model architectures\.NLN\_\{L\}: number of decoder layers;dd: hidden size;dffd\_\{\\text\{ff\}\}: dense FFN intermediate size;H/Hk​vH/H\_\{kv\}: query/KV attention heads;dhd\_\{h\}: per\-head dimension;BB: BDLM block size;KK: number of DCDM chunks \(Section[4\.1](https://arxiv.org/html/2605.15676#S4.SS1)\);hh: subspace dimension of the chunking clusers𝝁k∈ℝd×h\\bm\{\\mu\}\_\{k\}\\in\\mathbb\{R\}^\{d\\times h\};E/kE/k: number of experts / experts per token;dmoed\_\{\\text\{moe\}\}: per\-expert intermediate size;dsharedd\_\{\\text\{shared\}\}: shared\-expert intermediate size\.ModelNLN\_\{L\}dddffd\_\{\\text\{ff\}\}HHHk​vH\_\{kv\}dhd\_\{h\}BBKKhhE/kE/kdmoed\_\{\\text\{moe\}\}dsharedd\_\{\\text\{shared\}\}Vanilla Diffusion MDLMMDLM\-0\.1B1276830721212128––––––MDLM\-0\.5B2810243072168128––––––MDLM\-1\.7B2820486144168128––––––Block\-diffusion BDLMBDLM\-0\.1B12768307212121288–––––BDLM\-0\.5B28102430721681288–––––BDLM\-1\.7B28204861441681288–––––Chunked\-diffusion DCDMDCDM\-0\.1B \(K=8\)1276830721212128–864–––DCDM\-0\.5B \(K=8\)2810243072168128–8256–––DCDM\-0\.5B \(K=16\)2810243072168128–16128–––DCDM\-0\.5B \(K=32\)2810243072168128–32128–––DCDM\-0\.5B \(K=64\)2810243072168128–6496–––DCDM\-1\.7B \(K=8\)2820486144168128–8128–––DCDM\-1\.7B \(K=16\)2820486144168128–16128–––Sparse DCDM\-MoEDCDM\-MoE\-0\.8B/A0\.4B288962688168128–81288/2896896DCDM\-MoE\-2\.8B/A1\.2B2817925376168128–121928/217921792

## Appendix EAdditional Results

### E\.1Point vs\. Subspace Clusters

In Section[4\.1](https://arxiv.org/html/2605.15676#S4.SS1)we argued that point\-based clustering inside an attention layer is brittle in the high\-dimensional embedding spaces of modern language models, and motivated the subspace parameterization of chunking attention as a remedy\. We verify this claim empirically with an ablation that isolates the dimensionality of the cluster representation within the chunking attention layer itself, leaving every other component of the model untouched\.

#### Setup\.

We train two variants of DCDM at the 0\.1B scale that are identical in every aspect except the per\-cluster subspace dimensionhh\. Thepointvariant usesh=1h=1: each centroid𝝁k\\bm\{\\mu\}\_\{k\}collapses to a single direction inℝd\\mathbb\{R\}^\{d\}, and the associated “subspace” degenerates to the line it spans\. This is the minimal configuration of our parameterization and serves as a faithful in\-architecture analogue of point\-based attention clustering\. Thesubspacevariant usesh=48h=48, the value adopted throughout the rest of the paper\. The training data, tokenizer, optimizer, learning\-rate schedule, batch size, number of clustersKK, load\-balancing setup, and all other hyperparameters are held fixed across the two runs; the only difference is the trailing dimension of𝝁k\\bm\{\\mu\}\_\{k\}\.

#### Results\.

Figure[4](https://arxiv.org/html/2605.15676#A5.F4)reports two diagnostics for both variants: the diffusion training loss \(left\) and a cluster\-violation metric tracking how far the centroid usage distribution deviates from uniform routing \(right; values closer to zero indicate that theKKclusers receive comparable shares of tokens, while larger values indicate concentration of mass on a smaller subset\)\.

The diffusion\-loss panel matches the qualitative argument of Section[4\.1](https://arxiv.org/html/2605.15676#S4.SS1): the subspace variant reaches a substantially lower final loss \(2\.3042\.304vs\.2\.5442\.544at 200k steps\), a gap that opens within the first 25k steps and persists throughout training, and its trajectory is visibly less noisy\.

The cluster\-violation panel sharpens the picture from a quantitative gap into a qualitative failure mode\. The subspace variant drives violation to near zero within the first∼\\sim10k steps and remains essentially flat for the rest of training \(final value0\.0150\.015\): allKKclusers continue to receive a roughly uniform share of tokens, exactly as the auxiliary load\-balancing loss intends\. The point variant, in contrast, never manages to push violation below∼\\sim2 during the entire run, and its violation begins to rise again past∼\\sim125k steps, reaching3\.333\.33by the end of training\. The diffusion loss is near\-stationary in this late regime while the routing distribution is actively degrading: this is the qualitative signature of the centroid\-collapse failure mode anticipated in Section[4\.1](https://arxiv.org/html/2605.15676#S4.SS1), in which a few clusers absorb most of the routed mass, the remaining clusers are starved of gradient, and the auxiliary load\-balancing loss is no longer sufficient to recover them\.

Together, the two panels show that the difference between point\-based and subspace clustering is not a matter of degree, a more capable variant settling at a slightly better minimum, but a qualitative one: the subspace parameterization keeps routing well balanced throughout training, whereas the point parameterization actively degenerates even with the same load\-balancing mechanisms in place\.

![Refer to caption](https://arxiv.org/html/2605.15676v1/assets/images/point_vs_subspace.png)Figure 4:Point \(h=1h=1\) vs\. subspace \(h=48h=48\) chunking attention at the 0\.1B scale; all other training settings are identical\.Left:Diffusion training loss\. The subspace variant reaches a lower final loss \(2\.3042\.304vs\.2\.5442\.544\) and follows a smoother trajectory throughout training\.Right:Cluster violation, a measure of deviation from uniform centroid usage \(lower values indicate more uniform routing\)\. The subspace variant drives violation to near zero within the first∼\\sim10k steps and stays there; the point variant remains noisy throughout and rises sharply in the final 50k steps to3\.3303\.330, the qualitative signature of progressive centroid collapse\.

### E\.2Zero\-Shot Perplexity Evaluation

As a complementary check on language\-modeling quality, we evaluate the diffusion models against a standard left\-to\-right autoregressive transformer \(AR\) trained with the next\-token objective\. Following the convention of prior works\[[34](https://arxiv.org/html/2605.15676#bib.bib3),[2](https://arxiv.org/html/2605.15676#bib.bib4)\], all four models are trained on OpenWebText\[[14](https://arxiv.org/html/2605.15676#bib.bib34)\]for 64B tokens at the 0\.1B parameter scale, and then evaluated, without any further finetuning, on seven held\-out corpora spanning encyclopedic, news, scientific, and benchmark text: PTB\[[27](https://arxiv.org/html/2605.15676#bib.bib46)\], WikiText\[[29](https://arxiv.org/html/2605.15676#bib.bib45)\], LM1B\[[8](https://arxiv.org/html/2605.15676#bib.bib47)\], Lambada\[[31](https://arxiv.org/html/2605.15676#bib.bib48)\], AG News\[[46](https://arxiv.org/html/2605.15676#bib.bib49)\], PubMed, and ArXiv\[[11](https://arxiv.org/html/2605.15676#bib.bib50)\]\.

#### Metric\.

We report token\-level perplexity \(lower is better\)\. To ensure comparability across models with different output parameterizations, all perplexities are computed under a common tokenization and evaluation context length, with the diffusion models evaluated using the NELBO\-based perplexity surrogate ofSahooet al\.\[[34](https://arxiv.org/html/2605.15676#bib.bib3)\]\.

#### Results\.

Table[6](https://arxiv.org/html/2605.15676#A5.T6)reports the resulting zero\-shot perplexities\. The autoregressive baseline retains an overall advantage on perplexity, achieving the best score on PTB, WikiText, LM1B, and AG News\. This is consistent with the well\-documented gap between AR likelihoods and the NELBO\-based perplexity surrogate used by all diffusion models in the table\[[34](https://arxiv.org/html/2605.15676#bib.bib3),[2](https://arxiv.org/html/2605.15676#bib.bib4)\], so we focus the comparison on the three diffusion models, where the metric is computed under a common surrogate\.

Among the diffusion models, DCDM improves over the unstructured baseline MDLM on six of the seven corpora, with the largest reductions on PTB \(−3\.14\-3\.14\) and PubMed \(−2\.96\-2\.96\)\. The only corpus on which DCDM trails MDLM is LM1B, where the gap is 1\.16 points\. Against the positional\-block baseline BDLM, DCDM is better on five of the seven corpora, with WikiText and LM1B as the two exceptions\. The gap is 0\.63 points on WikiText but a more substantial 5\.22 points on LM1B\. On Lambada, PubMed, and ArXiv, DCDM additionally achieves the lowest perplexity among all four models, including AR\. On balance, replacing positional blocks with content\-defined semantic blocks yields consistent improvements in zero\-shot perplexity within the block\-diffusion family, while preserving the broader gains of block structure over unstructured masked diffusion\.

Table 6:Zero\-shot validation perplexities \(↓\\downarrow\) on OpenWebText\. MDLM and BDLM are trained for 256B tokens, while DCDM is trained for 128B tokens\. \* indicates data borrowed from BDLM\[[2](https://arxiv.org/html/2605.15676#bib.bib4)\]\.

## Appendix FLimitation

DCDM treats the number of semantic clustersKKas a fixed architectural hyperparameter, shared across all sequences and all training stages\. This simplification keeps the chunking attention layer efficient and admits stable optimization, but it also implies that the model uses the same number of semantic blocks regardless of the length or complexity of the input: sequences with little structural variety may end up over\-partitioned, while topically rich or unusually long sequences may be under\-partitioned\. To avoid this tension we follow the standard practice of related work in mixtures\-of\-experts and block diffusion language models, fixingKKat training time and validating its choice via ablation, and our experiments in Table[3](https://arxiv.org/html/2605.15676#S5.T3)confirm that DCDM is robust to the precise value ofKKwithin a reasonable range, with all configurations tested improving over the positional\-block baseline at every scale we evaluated\. Lifting this restriction by learning the number of clusters per sequence, or by maintaining a distribution overKKthat can be marginalized at inference time, is a natural extension and an interesting direction for future work\.

## Appendix GImpact

#### Ethical impacts\.

This work does not raise any direct ethical concerns\. All experiments are conducted on publicly available datasets, and the study involves neither private data nor subjective human assessments at any stage\.

#### Expected societal implications\.

The principal societal risk associated with DCDM is shared with any capable generative language model: potential misuse for producing misleading content, spam, or material that violates privacy or intellectual\-property norms\. Mitigations developed for the broader class of large language models, content provenance and watermarking, deployment\-time output filtering, and clear usage policies, apply equally here, and we encourage practitioners building on this work to adopt them\.

Similar Articles

Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models

arXiv cs.CL

This paper proposes Dynamic Infilling Anchors (DIA), a training-free method for diffusion large language models that dynamically estimates end-anchor positions to enforce format constraints (e.g., parseable JSON, reasoning templates) while avoiding the rigidity of fixed-span approaches. Experiments show significant zero-shot gains on GSM8K and MATH benchmarks.