Tag
iLLaDA is an 8B parameter masked diffusion language model with fully bidirectional attention, trained from scratch on 12T tokens. It shows broad improvements over LLaDA and remains competitive with Qwen2.5 7B on several benchmarks. The model and code are open-sourced.
This paper investigates the effectiveness of top-1 collapse rate as a stability monitor for short-horizon LoRA fine-tuning of discrete diffusion language models, finding it has zero precision, and proposes max gradient norm as a more reliable alternative with higher precision and F1 score on LLaDA-family models.
A systematic experimental analysis evaluating eight state-of-the-art Diffusion Language Models across multiple benchmarks, analyzing trade-offs between generation quality and computational efficiency.
Proposes Self-Generated T2T, a training method that aligns token editing training with inference by using the model's own predictions as error sources, improving accuracy on LLaDA2.1.
PerceptionDLM introduces a multimodal diffusion language model that enables parallel region perception via structured attention masking and efficient prompting, achieving faster inference without sacrificing caption quality. Experiments show competitive performance with substantial speed improvements for multi-region perception tasks.
This paper theoretically analyzes diffusion language models through a bias-variance lens, identifying trade-offs between masking and uniform diffusion kernels. It proposes SemDLM+, which adds a global transition and semantic-frequency penalty to overcome the semantic basin problem, achieving competitive generation quality on LM1B and OpenWebText benchmarks.
This paper proposes three training-time interventions (positional weighting, first-error focal loss, and chain loss) to align diffusion-based draft models with autoregressive verification in speculative decoding, improving accepted prefix length by 21–76% without extra inference cost.
This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.
This paper proposes BiCache, a novel KV caching technique for shared prefixes in diffusion language models, which avoids accuracy collapse by dynamically reusing cached keys and values in shallow layers and achieves 36.3%–98.3% throughput improvement.
This paper proposes Dynamic Infilling Anchors (DIA), a training-free method for diffusion large language models that dynamically estimates end-anchor positions to enforce format constraints (e.g., parseable JSON, reasoning templates) while avoiding the rigidity of fixed-span approaches. Experiments show significant zero-shot gains on GSM8K and MATH benchmarks.
This paper introduces CAPR (Cached-Amortized Path Refinement), a reinforcement learning algorithm for diffusion large language models that extracts tree-like supervision signals from the denoising trace without the compute cost of full tree rollouts. CAPR achieves state-of-the-art performance on reasoning benchmarks like GSM8K, Math500, Sudoku, and Countdown at roughly 0.75x the cost of flat rollouts.
This paper proposes AXON, a training-free module that improves the quality-latency trade-off of discrete diffusion language model decoding by intelligently selecting 'anchor' tokens to reveal first, using attention, uncertainty, and confidence signals to support subsequent denoising steps. Experiments on reasoning and code-generation benchmarks show AXON reduces function evaluations while maintaining or improving accuracy.
This paper presents EPIC, an efficient framework for context-free grammar constrained decoding in diffusion language models that reduces inference time by up to 67.5% while maintaining syntactic correctness.
Introduces DLLM-JEPA, a JEPA formulation for masked diffusion language models that constructs two views from a single input via the diffusion noise schedule, reducing training FLOPs by 33% relative to LLM-JEPA and improving fine-tuning performance on tasks like GSM8K.
dMoE proposes block-level expert routing for diffusion LLMs, reducing the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% performance and achieving 76-80% memory reduction with 1.14-1.66× speedup.
GDSD proposes a reinforcement learning method that directly distills denoisers from advantage-guided self-teachers for diffusion language models, avoiding biases from ELBO-based likelihood surrogates. It achieves up to +19.6% accuracy improvements on planning, math, and coding benchmarks over prior state-of-the-art methods.
Researchers propose a training-free method called Suffix-Anchored Confidence Modulation to improve confidence-based decoding in diffusion language models by addressing issues with EOT tokens and premature decoding.
dlmserve is the first open-source serving engine for diffusion language models, providing an OpenAI-compatible API, continuous batching, and 2.5x throughput over Hugging Face, all within 12GB VRAM.
This paper introduces TraceLock, a lightweight plug-in controller that learns a token-commitment policy for frozen diffusion language models, improving the quality-step tradeoff across various tasks without retraining.
This paper introduces infilling extraction, a new method for extracting training data from diffusion language models by using arbitrary binary masks, showing that such models are more vulnerable to memorization attacks than previously thought.