Multi-Block Diffusion Language Models

Hugging Face Daily Papers 06/30/26, 12:00 AM Papers

diffusion language-models multi-block text-generation decoding training inference

Summary

This paper proposes Multi-Block Diffusion Language Models (MBD-LMs), extending single-block diffusion to concurrent multi-block decoding with improved training strategies like Multi-block Teacher Forcing and an optimized Block Buffer decoding algorithm. Experiments show increased tokens per forward pass and improved accuracy on benchmarks.

Block Diffusion Language Models (BD-LMs) improve diffusion-based text generation with KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) to Multi-Block Diffusion (MultiBD), where a running-set of consecutive blocks is decoded concurrently for inter-block parallelism. However, existing BD-LMs are mostly trained under teacher forcing, where the model observes only one noisy block conditioned on a clean prefix. While the recent diffusion forcing strategy introduces visibility among multiple noisy blocks, its training states still differ from MultiBD inference, where decoding operates on a bounded running-set with heterogeneous slot-wise noise patterns. To bridge this gap, we propose Multi-Block Diffusion Language Models (MBD-LMs), obtained by post-training BD-LMs with Multi-block Teacher Forcing (MultiTF). MultiTF integrates teacher forcing and diffusion forcing by training on bounded noise-groups conditioned on clean prefixes, with randomized noise-schedulers that better match MultiBD inference states. To make MultiBD practically executable, we further introduce an optimized decoding algorithm based on the Block Buffer mechanism that preserves prefix-cache reuse, keeps input shapes static, and translates increased decoding parallelism into wall-clock acceleration. Empirically, MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%; when combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02% accuracy drop on math and code benchmarks.

Original Article

View Cached Full Text

Cached at: 07/01/26, 07:41 AM

Paper page - Multi-Block Diffusion Language Models

Source: https://huggingface.co/papers/2606.29215 Authors:

Abstract

Multi-Block Diffusion Language Models extend single-block diffusion to concurrent block decoding with improved training strategies and optimized decoding algorithms.

Block Diffusion Language Models(BD-LMs) improvediffusion-based text generationwith KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) toMulti-Block Diffusion(MultiBD), where a running-set of consecutive blocks is decoded concurrently forinter-block parallelism. However, existing BD-LMs are mostly trained underteacher forcing, where the model observes only one noisy block conditioned on a clean prefix. While the recentdiffusion forcingstrategy introduces visibility among multiple noisy blocks, its training states still differ from MultiBD inference, where decoding operates on a bounded running-set with heterogeneous slot-wise noise patterns. To bridge this gap, we proposeMulti-Block DiffusionLanguage Models (MBD-LMs), obtained by post-training BD-LMs withMulti-block Teacher Forcing(MultiTF). MultiTF integratesteacher forcinganddiffusion forcingby training on bounded noise-groups conditioned on clean prefixes, with randomizednoise-schedulersthat better match MultiBD inference states. To make MultiBD practically executable, we further introduce an optimized decoding algorithm based on theBlock Buffer mechanismthat preservesprefix-cache reuse, keeps input shapes static, and translates increased decoding parallelism intowall-clock acceleration. Empirically, MBD-LLaDA2-Mini increases averageTokens Per Forward pass(TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%; when combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02% accuracy drop on math and code benchmarks.

View arXiv page View PDF Project page GitHub15 Add to collection

Get this paper in your agent:

hf papers read 2606\.29215

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper8

#### SJTU-DENG-Lab/MBD-Math-LLaDA2-mini-16B 16B• Updatedabout 2 hours ago • 19 #### SJTU-DENG-Lab/MBD-Math-LLaDA2-mini-DMax-16B 16B• Updatedabout 2 hours ago • 21 #### SJTU-DENG-Lab/MBD-Code-LLaDA2-mini-DMax-16B 16B• Updatedabout 2 hours ago • 10 #### SJTU-DENG-Lab/MBD-Code-LLaDA2-mini-16B 16B• Updatedabout 2 hours ago • 110 Browse 8 models citing this paper## Datasets citing this paper1

#### SJTU-DENG-Lab/MBD-LMs-MultiTF-Datasets Preview• Updatedabout 2 hours ago • 42

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.29215 in a Space README.md to link it from this page.

Multi-Block Diffusion Language Models

Paper page - Multi-Block Diffusion Language Models

Abstract

Models citing this paper8

Spaces citing this paper0

Collections including this paper1

Similar Articles

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

Dynamic Chunking for Diffusion Language Models

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Submit Feedback

Similar Articles

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

Dynamic Chunking for Diffusion Language Models

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models