Multi-Block Diffusion Language Models

Hugging Face Daily Papers Papers

Summary

This paper proposes Multi-Block Diffusion Language Models (MBD-LMs), extending single-block diffusion to concurrent multi-block decoding with improved training strategies like Multi-block Teacher Forcing and an optimized Block Buffer decoding algorithm. Experiments show increased tokens per forward pass and improved accuracy on benchmarks.

Block Diffusion Language Models (BD-LMs) improve diffusion-based text generation with KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) to Multi-Block Diffusion (MultiBD), where a running-set of consecutive blocks is decoded concurrently for inter-block parallelism. However, existing BD-LMs are mostly trained under teacher forcing, where the model observes only one noisy block conditioned on a clean prefix. While the recent diffusion forcing strategy introduces visibility among multiple noisy blocks, its training states still differ from MultiBD inference, where decoding operates on a bounded running-set with heterogeneous slot-wise noise patterns. To bridge this gap, we propose Multi-Block Diffusion Language Models (MBD-LMs), obtained by post-training BD-LMs with Multi-block Teacher Forcing (MultiTF). MultiTF integrates teacher forcing and diffusion forcing by training on bounded noise-groups conditioned on clean prefixes, with randomized noise-schedulers that better match MultiBD inference states. To make MultiBD practically executable, we further introduce an optimized decoding algorithm based on the Block Buffer mechanism that preserves prefix-cache reuse, keeps input shapes static, and translates increased decoding parallelism into wall-clock acceleration. Empirically, MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%; when combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02% accuracy drop on math and code benchmarks.
Original Article
View Cached Full Text

Cached at: 07/01/26, 07:41 AM

Paper page - Multi-Block Diffusion Language Models

Source: https://huggingface.co/papers/2606.29215 Authors:

,

,

,

,

,

,

,

,

,

Abstract

Multi-Block Diffusion Language Models extend single-block diffusion to concurrent block decoding with improved training strategies and optimized decoding algorithms.

Block Diffusion Language Models(BD-LMs) improvediffusion-based text generationwith KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) toMulti-Block Diffusion(MultiBD), where a running-set of consecutive blocks is decoded concurrently forinter-block parallelism. However, existing BD-LMs are mostly trained underteacher forcing, where the model observes only one noisy block conditioned on a clean prefix. While the recentdiffusion forcingstrategy introduces visibility among multiple noisy blocks, its training states still differ from MultiBD inference, where decoding operates on a bounded running-set with heterogeneous slot-wise noise patterns. To bridge this gap, we proposeMulti-Block DiffusionLanguage Models (MBD-LMs), obtained by post-training BD-LMs withMulti-block Teacher Forcing(MultiTF). MultiTF integratesteacher forcinganddiffusion forcingby training on bounded noise-groups conditioned on clean prefixes, with randomizednoise-schedulersthat better match MultiBD inference states. To make MultiBD practically executable, we further introduce an optimized decoding algorithm based on theBlock Buffer mechanismthat preservesprefix-cache reuse, keeps input shapes static, and translates increased decoding parallelism intowall-clock acceleration. Empirically, MBD-LLaDA2-Mini increases averageTokens Per Forward pass(TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%; when combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02% accuracy drop on math and code benchmarks.

View arXiv pageView PDFProject pageGitHub15Add to collection

Get this paper in your agent:

hf papers read 2606\.29215

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper8

#### SJTU-DENG-Lab/MBD-Math-LLaDA2-mini-16B 16B• Updatedabout 2 hours ago • 19 #### SJTU-DENG-Lab/MBD-Math-LLaDA2-mini-DMax-16B 16B• Updatedabout 2 hours ago • 21 #### SJTU-DENG-Lab/MBD-Code-LLaDA2-mini-DMax-16B 16B• Updatedabout 2 hours ago • 10 #### SJTU-DENG-Lab/MBD-Code-LLaDA2-mini-16B 16B• Updatedabout 2 hours ago • 110 Browse 8 models citing this paper## Datasets citing this paper1

#### SJTU-DENG-Lab/MBD-LMs-MultiTF-Datasets Preview• Updatedabout 2 hours ago • 42

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.29215 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Dynamic Chunking for Diffusion Language Models

arXiv cs.CL

This paper introduces Dynamic Chunking for Diffusion Language Models (DCDM), which replaces fixed positional blocks in block discrete diffusion with content-defined semantic chunks using a differentiable Chunking Attention mechanism, achieving consistent improvements across scales up to 1.5B parameters.

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Hugging Face Daily Papers

PerceptionDLM introduces a multimodal diffusion language model that enables parallel region perception via structured attention masking and efficient prompting, achieving faster inference without sacrificing caption quality. Experiments show competitive performance with substantial speed improvements for multi-region perception tasks.