Tag
This paper introduces ADAS, a training-free reranking rule for parallel masked diffusion decoding that uses attention to discount tokens that strongly attend to uncertain positions, improving low-NFE performance on reasoning and code tasks with minimal runtime overhead.
The author built Joey, a 170M parameter masked diffusion language model from scratch, trained on FineWeb-Edu and fine-tuned on DailyDialog, achieving fluent but incoherent sentences due to capacity limitations. The project highlights the differences from autoregressive LLMs and the lessons learned from building and debugging the system.
Proposes learning the unmasking order in masked diffusion models using a lightweight policy network, with a weighted loss that outperforms heuristics on combinatorial tasks and protein design.
Introduces DLLM-JEPA, a JEPA formulation for masked diffusion language models that constructs two views from a single input via the diffusion noise schedule, reducing training FLOPs by 33% relative to LLM-JEPA and improving fine-tuning performance on tasks like GSM8K.
This paper identifies a failure mode in masked diffusion language models where confidence-based decoding leads to high-confidence errors on complex reasoning tasks, and shows that confidence-aligned training exacerbates this issue while random masking preserves reasoning performance.
This paper proposes using Masked Diffusion Language Models (MDLMs) as text-based world models for agentic reinforcement learning, showing that their any-order denoising objective avoids prefix mode collapse and leads to stronger performance than autoregressive baselines.
AnchorDiff proposes a topology-aware masked diffusion framework for radiology report generation, integrating RadGraph-derived clinical anchors and confidence-based rewriting to achieve state-of-the-art results on MIMIC-CXR and MIMIC-RG4 benchmarks.
Introduces Discrete Stochastic Localization (DSL), a continuous-state diffusion framework for non-autoregressive text generation that uses unit-sphere token embeddings and a timestep-invariant denoiser, achieving better distributional faithfulness than masked discrete diffusion models on OpenWebText.
Introduces Token-to-Mask (T2M) remasking to fix generation errors in masked diffusion LMs by resetting suspect tokens to mask state instead of overwriting, yielding up to +5.92 accuracy on CMATH without extra training or parameters.
CRoCoDiL proposes a continuous and robust conditioned diffusion approach for language that shifts masked diffusion models into a continuous semantic space, achieving superior generation quality and 10x faster sampling speeds compared to discrete methods like LLaDA.