masked-diffusion

#masked-diffusion

Trace-Based On-Policy Distillation for Masked Diffusion Language Models

arXiv cs.CL ↗ · 3h ago Cached

A paper proposing Trace-Based On-Policy Distillation (TOPD), a teacher-supervised framework for transferring reasoning abilities to masked diffusion language models without reward estimation, achieving comparable accuracy to RL-trained counterparts with significant compute speedup.

0 favorites 0 likes

#masked-diffusion

Reinforcing the Generation Order of Multimodal Masked Diffusion Models

arXiv cs.LG ↗ · 2026-07-10 Cached

This paper introduces a learnable control module trained via Group Relative Policy Optimization (GRPO) to optimize the generation order in multimodal masked diffusion models, achieving improvements in text-to-image alignment and multimodal understanding.

0 favorites 0 likes

#masked-diffusion

@volokuleshov: New blog post: How to Build a Diffusion Language Model. Diffusion LLMs went from open problem to reality in 2 years (Me…

X AI KOLs Timeline ↗ · 2026-07-08 Cached

A comprehensive blog post by Volodymyr Kuleshov's Cornell group explains how to build diffusion language models, covering core techniques like masked diffusion, iterative refinement, variable-length generation, controllable generation, fast samplers, and RL post-training, using open-source models such as Mercury, Gemma Diffusion, and Nemotron Diffusion as examples.

0 favorites 0 likes

#masked-diffusion

Masked Diffusion Decoding as $x$-Prediction Flow

arXiv cs.CL ↗ · 2026-06-30 Cached

This paper reinterprets masked diffusion language model decoding as continuous clean-state prediction, introducing a flow-based framework where tokens are updated continuously and asynchronously based on confidence, achieving 97% of LLaDA's performance with 25% of the decoding budget.

0 favorites 0 likes

#masked-diffusion

Masked Language Flow Models

arXiv cs.CL ↗ · 2026-06-29 Cached

This paper introduces Masked Language Flow Models (MLFMs), which incorporate masking into flow-based language models to enable continuous flow for conditional generation and allow pretrained Masked Diffusion Models to be converted. The authors propose a novel sampler that alternates continuous denoising with discrete unmasking, demonstrating for the first time that flow-based language models can scale to downstream reasoning and instruction-following tasks.

0 favorites 0 likes

#masked-diffusion

Improved Large Language Diffusion Models

arXiv cs.CL ↗ · 2026-06-25 Cached

iLLaDA is an 8B parameter masked diffusion language model with fully bidirectional attention, trained from scratch on 12T tokens. It shows broad improvements over LLaDA and remains competitive with Qwen2.5 7B on several benchmarks. The model and code are open-sourced.

0 favorites 0 likes

#masked-diffusion

VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

arXiv cs.CL ↗ · 2026-06-17 Cached

VoidPadding introduces a [VOID] token to handle padding in masked diffusion language models, allowing [EOS] to focus solely on semantic termination. This method significantly improves performance on reasoning and coding benchmarks while reducing decoding steps.

0 favorites 0 likes

#masked-diffusion

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

This paper proposes TIE, a knowledge fusion framework for masked diffusion language models that tracks confidence dynamics to identify reliable decoding trajectories and iteratively transfers partially denoised sequences between models, improving generation quality on reasoning tasks.

0 favorites 0 likes

#masked-diffusion

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

arXiv cs.CL ↗ · 2026-06-10 Cached

This paper introduces ADAS, a training-free reranking rule for parallel masked diffusion decoding that uses attention to discount tokens that strongly attend to uncertain positions, improving low-NFE performance on reasoning and code tasks with minimal runtime overhead.

0 favorites 0 likes

#masked-diffusion

I built a diffusion language model from scratch. It writes flawless sentences that mean nothing, and that is the interesting part.

Reddit r/AI_Agents ↗ · 2026-06-08

The author built Joey, a 170M parameter masked diffusion language model from scratch, trained on FineWeb-Edu and fine-tuned on DailyDialog, achieving fluent but incoherent sentences due to capacity limitations. The project highlights the differences from autoregressive LLMs and the lessons learned from building and debugging the system.

0 favorites 0 likes

#masked-diffusion

Adaptive Order Policies for Masked Diffusion

arXiv cs.LG ↗ · 2026-06-02 Cached

Proposes learning the unmasking order in masked diffusion models using a lightweight policy network, with a weighted loss that outperforms heuristics on combinatorial tasks and protein design.

0 favorites 0 likes

#masked-diffusion

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

arXiv cs.CL ↗ · 2026-06-02 Cached

Introduces DLLM-JEPA, a JEPA formulation for masked diffusion language models that constructs two views from a single input via the diffusion noise schedule, reducing training FLOPs by 33% relative to LLM-JEPA and improving fine-tuning performance on tasks like GSM8K.

0 favorites 0 likes

#masked-diffusion

The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper identifies a failure mode in masked diffusion language models where confidence-based decoding leads to high-confidence errors on complex reasoning tasks, and shows that confidence-aligned training exacerbates this issue while random masking preserves reasoning performance.

0 favorites 0 likes

#masked-diffusion

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]

Reddit r/MachineLearning ↗ · 2026-05-21

This paper proposes using Masked Diffusion Language Models (MDLMs) as text-based world models for agentic reinforcement learning, showing that their any-order denoising objective avoids prefix mode collapse and leads to stronger performance than autoregressive baselines.

0 favorites 0 likes

#masked-diffusion

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

arXiv cs.AI ↗ · 2026-05-19 Cached

AnchorDiff proposes a topology-aware masked diffusion framework for radiology report generation, integrating RadGraph-derived clinical anchors and confidence-based rewriting to achieve state-of-the-art results on MIMIC-CXR and MIMIC-RG4 benchmarks.

0 favorites 0 likes

#masked-diffusion

Discrete Stochastic Localization for Non-autoregressive Generation

arXiv cs.LG ↗ · 2026-05-14 Cached

Introduces Discrete Stochastic Localization (DSL), a continuous-state diffusion framework for non-autoregressive text generation that uses unit-sphere token embeddings and a timestep-invariant denoiser, achieving better distributional faithfulness than masked discrete diffusion models on OpenWebText.

0 favorites 0 likes

#masked-diffusion

Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

arXiv cs.CL ↗ · 2026-04-22 Cached

Introduces Token-to-Mask (T2M) remasking to fix generation errors in masked diffusion LMs by resetting suspect tokens to mask state instead of overwriting, yielding up to +5.92 accuracy on CMATH without extra training or parameters.

0 favorites 0 likes

#masked-diffusion

CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language

arXiv cs.CL ↗ · 2026-04-20 Cached

CRoCoDiL proposes a continuous and robust conditioned diffusion approach for language that shifts masked diffusion models into a continuous semantic space, achieving superior generation quality and 10x faster sampling speeds compared to discrete methods like LLaDA.

0 favorites 0 likes

masked-diffusion

Submit Feedback