Multi-Block Diffusion Language Models
Summary
This paper proposes Multi-Block Diffusion Language Models (MBD-LMs), extending single-block diffusion to concurrent multi-block decoding with improved training strategies like Multi-block Teacher Forcing and an optimized Block Buffer decoding algorithm. Experiments show increased tokens per forward pass and improved accuracy on benchmarks.
View Cached Full Text
Cached at: 07/01/26, 07:41 AM
Paper page - Multi-Block Diffusion Language Models
Source: https://huggingface.co/papers/2606.29215 Authors:
,
,
,
,
,
,
,
,
,
Abstract
Multi-Block Diffusion Language Models extend single-block diffusion to concurrent block decoding with improved training strategies and optimized decoding algorithms.
Block Diffusion Language Models(BD-LMs) improvediffusion-based text generationwith KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) toMulti-Block Diffusion(MultiBD), where a running-set of consecutive blocks is decoded concurrently forinter-block parallelism. However, existing BD-LMs are mostly trained underteacher forcing, where the model observes only one noisy block conditioned on a clean prefix. While the recentdiffusion forcingstrategy introduces visibility among multiple noisy blocks, its training states still differ from MultiBD inference, where decoding operates on a bounded running-set with heterogeneous slot-wise noise patterns. To bridge this gap, we proposeMulti-Block DiffusionLanguage Models (MBD-LMs), obtained by post-training BD-LMs withMulti-block Teacher Forcing(MultiTF). MultiTF integratesteacher forcinganddiffusion forcingby training on bounded noise-groups conditioned on clean prefixes, with randomizednoise-schedulersthat better match MultiBD inference states. To make MultiBD practically executable, we further introduce an optimized decoding algorithm based on theBlock Buffer mechanismthat preservesprefix-cache reuse, keeps input shapes static, and translates increased decoding parallelism intowall-clock acceleration. Empirically, MBD-LLaDA2-Mini increases averageTokens Per Forward pass(TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%; when combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02% accuracy drop on math and code benchmarks.
View arXiv pageView PDFProject pageGitHub15Add to collection
Get this paper in your agent:
hf papers read 2606\.29215
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper8
#### SJTU-DENG-Lab/MBD-Math-LLaDA2-mini-16B 16B• Updatedabout 2 hours ago • 19
#### SJTU-DENG-Lab/MBD-Math-LLaDA2-mini-DMax-16B 16B• Updatedabout 2 hours ago • 21
#### SJTU-DENG-Lab/MBD-Code-LLaDA2-mini-DMax-16B 16B• Updatedabout 2 hours ago • 10
#### SJTU-DENG-Lab/MBD-Code-LLaDA2-mini-16B 16B• Updatedabout 2 hours ago • 110
Browse 8 models citing this paper## Datasets citing this paper1
#### SJTU-DENG-Lab/MBD-LMs-MultiTF-Datasets Preview• Updatedabout 2 hours ago • 42
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.29215 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
This paper introduces BitLM, a language model that uses bitwise continuous diffusion to generate multiple tokens in parallel, aiming to overcome the sequential bottleneck of traditional autoregressive generation while preserving causal structure.
Dynamic Chunking for Diffusion Language Models
This paper introduces Dynamic Chunking for Diffusion Language Models (DCDM), which replaces fixed positional blocks in block discrete diffusion with content-defined semantic chunks using a differentiable Chunking Attention mechanism, achieving consistent improvements across scales up to 1.5B parameters.
Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM
This paper proposes Dynamic-dLLM, a training-free framework that accelerates diffusion large language models by dynamically allocating cache-update budgets and calibrating decoding thresholds, achieving over 3x speedup on models like LLaDA and Dream while maintaining performance.
Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation
This paper introduces Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE) to accelerate inference in diffusion-based large language models by dynamically deciding when tokens have converged and forecasting logit trends, reducing unnecessary denoising steps while preserving output quality.
PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models
PerceptionDLM introduces a multimodal diffusion language model that enables parallel region perception via structured attention masking and efficient prompting, achieving faster inference without sacrificing caption quality. Experiments show competitive performance with substantial speed improvements for multi-region perception tasks.