VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

arXiv cs.CL 06/17/26, 04:00 AM Papers

masked-diffusion language-models padding-tokens eos instruction-tuning text-generation voidpadding

Summary

VoidPadding introduces a [VOID] token to handle padding in masked diffusion language models, allowing [EOS] to focus solely on semantic termination. This method significantly improves performance on reasoning and coding benchmarks while reducing decoding steps.

arXiv:2606.17999v1 Announce Type: new Abstract: MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt{[EOS]} tokens for padding during instruction tuning, giving \texttt{[EOS]} a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt{[EOS]} overflow under large-block decoding. To decouple these roles, we propose VoidPadding, which introduces \texttt{[VOID]} for padding and reserves \texttt{[EOS]} for termination. During inference, the learned \texttt{[EOS]} signal enables early stopping, while the learned \texttt{[VOID]} signal guides adaptive response canvas expansion. On Dream-7B-Instruct, VoidPadding improves the block-size-averaged four-task mean across mathematical reasoning and code generation benchmarks by \(+17.84\) points over the original model and \(+6.95\) points over RainbowPadding, while reducing decoding NFE by 55.7\% on average. Code is available at https://github.com/Haru-LCY/VoidPadding.

Original Article

View Cached Full Text

Cached at: 06/17/26, 05:42 AM

# VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination
Source: [https://arxiv.org/abs/2606.17999](https://arxiv.org/abs/2606.17999)
[View PDF](https://arxiv.org/pdf/2606.17999)

> Abstract:MDLMs generate text by denoising a preallocated masked response canvas, making response\-length modeling central to instruction tuning\. Existing MDLMs often inherit the autoregressive convention of using repeated \\texttt\{\[EOS\]\} tokens for padding during instruction tuning, giving \\texttt\{\[EOS\]\} a dual role as both a semantic terminator and a padding token\. We show that this dual role is a root cause of \\texttt\{\[EOS\]\} overflow under large\-block decoding\. To decouple these roles, we propose VoidPadding, which introduces \\texttt\{\[VOID\]\} for padding and reserves \\texttt\{\[EOS\]\} for termination\. During inference, the learned \\texttt\{\[EOS\]\} signal enables early stopping, while the learned \\texttt\{\[VOID\]\} signal guides adaptive response canvas expansion\. On Dream\-7B\-Instruct, VoidPadding improves the block\-size\-averaged four\-task mean across mathematical reasoning and code generation benchmarks by \\\(\+17\.84\\\) points over the original model and \\\(\+6\.95\\\) points over RainbowPadding, while reducing decoding NFE by 55\.7\\% on average\. Code is available at[this https URL](https://github.com/Haru-LCY/VoidPadding)\.

## Submission history

From: Chunyu Liu \[[view email](https://arxiv.org/show-email/63c8aad2/2606.17999)\] **\[v1\]**Tue, 16 Jun 2026 14:46:53 UTC \(2,532 KB\)

VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

Similar Articles

Supportive Token Revealing for Fast Diffusion Language Model Decoding

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models

Submit Feedback

Similar Articles

Supportive Token Revealing for Fast Diffusion Language Model Decoding

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models