VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination
Summary
VoidPadding introduces a [VOID] token to handle padding in masked diffusion language models, allowing [EOS] to focus solely on semantic termination. This method significantly improves performance on reasoning and coding benchmarks while reducing decoding steps.
View Cached Full Text
Cached at: 06/17/26, 05:42 AM
# VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination
Source: [https://arxiv.org/abs/2606.17999](https://arxiv.org/abs/2606.17999)
[View PDF](https://arxiv.org/pdf/2606.17999)
> Abstract:MDLMs generate text by denoising a preallocated masked response canvas, making response\-length modeling central to instruction tuning\. Existing MDLMs often inherit the autoregressive convention of using repeated \\texttt\{\[EOS\]\} tokens for padding during instruction tuning, giving \\texttt\{\[EOS\]\} a dual role as both a semantic terminator and a padding token\. We show that this dual role is a root cause of \\texttt\{\[EOS\]\} overflow under large\-block decoding\. To decouple these roles, we propose VoidPadding, which introduces \\texttt\{\[VOID\]\} for padding and reserves \\texttt\{\[EOS\]\} for termination\. During inference, the learned \\texttt\{\[EOS\]\} signal enables early stopping, while the learned \\texttt\{\[VOID\]\} signal guides adaptive response canvas expansion\. On Dream\-7B\-Instruct, VoidPadding improves the block\-size\-averaged four\-task mean across mathematical reasoning and code generation benchmarks by \\\(\+17\.84\\\) points over the original model and \\\(\+6\.95\\\) points over RainbowPadding, while reducing decoding NFE by 55\.7\\% on average\. Code is available at[this https URL](https://github.com/Haru-LCY/VoidPadding)\.
## Submission history
From: Chunyu Liu \[[view email](https://arxiv.org/show-email/63c8aad2/2606.17999)\] **\[v1\]**Tue, 16 Jun 2026 14:46:53 UTC \(2,532 KB\)Similar Articles
Supportive Token Revealing for Fast Diffusion Language Model Decoding
This paper proposes AXON, a training-free module that improves the quality-latency trade-off of discrete diffusion language model decoding by intelligently selecting 'anchor' tokens to reveal first, using attention, uncertainty, and confidence signals to support subsequent denoising steps. Experiments on reasoning and code-generation benchmarks show AXON reduces function evaluations while maintaining or improving accuracy.
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
This paper introduces Scratchpad Patching, a technique for tokenizer-free language models that decouples compute from patch size by dynamically refreshing context within patches to reduce patch lag.
DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models
Introduces DLLM-JEPA, a JEPA formulation for masked diffusion language models that constructs two views from a single input via the diffusion noise schedule, reducing training FLOPs by 33% relative to LLM-JEPA and improving fine-tuning performance on tasks like GSM8K.
MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training
MaskAlign proposes a token-subset representation alignment method that improves diffusion transformer training by reducing reliance on complete token sets and maintaining stable alignment under perturbations.
When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models
Researchers propose a training-free method called Suffix-Anchored Confidence Modulation to improve confidence-based decoding in diffusion language models by addressing issues with EOT tokens and premature decoding.