masked-diffusion

#masked-diffusion

PreDiff-LM: Pretrained Discrete Masked Diffusion Language Modeling with Hybrid Attention

arXiv cs.AI ↗ · 4d ago Cached

PreDiff-LM proposes a hybrid attention mechanism that preserves causal attention for prompt tokens and bidirectional attention for masked target tokens, enabling adaptation of pretrained autoregressive models for discrete masked diffusion language modeling, achieving improvements in perplexity and downstream tasks over prior diffusion baselines.

0 favorites 0 likes

#masked-diffusion

CaRE Compute-aware Remasking Evaluation Protocol for Masked Diffusion Language Models

arXiv cs.AI ↗ · 4d ago Cached

This paper introduces CaRE, a compute-aware evaluation protocol for masked diffusion language models that standardizes step counts, metrics, and stochasticity. It demonstrates that previous comparisons conflate algorithmic improvements with evaluation artifacts, showing temperature explains most MAUVE variance and compute-matched comparisons reverse published rankings.

0 favorites 0 likes

#masked-diffusion

MotifRole-Diff: Risk-Optimal Role-Aware Corruption for Masked Molecular Graph Diffusion

arXiv cs.LG ↗ · 6d ago Cached

MotifRole-Diff proposes a role-aware corruption schedule for masked discrete diffusion on molecular graphs, allocating masking rates based on denoising difficulty and graph-level perturbation impact, demonstrating improved validity and reduced FCD on QM9 and MOSES benchmarks.

0 favorites 0 likes

#masked-diffusion

Trace-Based On-Policy Distillation for Masked Diffusion Language Models

arXiv cs.CL ↗ · 2026-07-21 Cached

A paper proposing Trace-Based On-Policy Distillation (TOPD), a teacher-supervised framework for transferring reasoning abilities to masked diffusion language models without reward estimation, achieving comparable accuracy to RL-trained counterparts with significant compute speedup.

0 favorites 0 likes

#masked-diffusion

Reinforcing the Generation Order of Multimodal Masked Diffusion Models

arXiv cs.LG ↗ · 2026-07-10 Cached

This paper introduces a learnable control module trained via Group Relative Policy Optimization (GRPO) to optimize the generation order in multimodal masked diffusion models, achieving improvements in text-to-image alignment and multimodal understanding.

0 favorites 0 likes

#masked-diffusion

@volokuleshov: New blog post: How to Build a Diffusion Language Model. Diffusion LLMs went from open problem to reality in 2 years (Me…

X AI KOLs Timeline ↗ · 2026-07-08 Cached

A comprehensive blog post by Volodymyr Kuleshov's Cornell group explains how to build diffusion language models, covering core techniques like masked diffusion, iterative refinement, variable-length generation, controllable generation, fast samplers, and RL post-training, using open-source models such as Mercury, Gemma Diffusion, and Nemotron Diffusion as examples.

0 favorites 0 likes

#masked-diffusion

Masked Diffusion Decoding as $x$-Prediction Flow

arXiv cs.CL ↗ · 2026-06-30 Cached

This paper reinterprets masked diffusion language model decoding as continuous clean-state prediction, introducing a flow-based framework where tokens are updated continuously and asynchronously based on confidence, achieving 97% of LLaDA's performance with 25% of the decoding budget.

0 favorites 0 likes

#masked-diffusion

Masked Language Flow Models

arXiv cs.CL ↗ · 2026-06-29 Cached

This paper introduces Masked Language Flow Models (MLFMs), which incorporate masking into flow-based language models to enable continuous flow for conditional generation and allow pretrained Masked Diffusion Models to be converted. The authors propose a novel sampler that alternates continuous denoising with discrete unmasking, demonstrating for the first time that flow-based language models can scale to downstream reasoning and instruction-following tasks.

0 favorites 0 likes

#masked-diffusion

Improved Large Language Diffusion Models

arXiv cs.CL ↗ · 2026-06-25 Cached

iLLaDA is an 8B parameter masked diffusion language model with fully bidirectional attention, trained from scratch on 12T tokens. It shows broad improvements over LLaDA and remains competitive with Qwen2.5 7B on several benchmarks. The model and code are open-sourced.

0 favorites 0 likes

#masked-diffusion

VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

arXiv cs.CL ↗ · 2026-06-17 Cached

VoidPadding introduces a [VOID] token to handle padding in masked diffusion language models, allowing [EOS] to focus solely on semantic termination. This method significantly improves performance on reasoning and coding benchmarks while reducing decoding steps.

0 favorites 0 likes

#masked-diffusion

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

This paper proposes TIE, a knowledge fusion framework for masked diffusion language models that tracks confidence dynamics to identify reliable decoding trajectories and iteratively transfers partially denoised sequences between models, improving generation quality on reasoning tasks.

0 favorites 0 likes

#masked-diffusion

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

arXiv cs.CL ↗ · 2026-06-10 Cached

This paper introduces ADAS, a training-free reranking rule for parallel masked diffusion decoding that uses attention to discount tokens that strongly attend to uncertain positions, improving low-NFE performance on reasoning and code tasks with minimal runtime overhead.

0 favorites 0 likes

#masked-diffusion

I built a diffusion language model from scratch. It writes flawless sentences that mean nothing, and that is the interesting part.

Reddit r/AI_Agents ↗ · 2026-06-08

The author built Joey, a 170M parameter masked diffusion language model from scratch, trained on FineWeb-Edu and fine-tuned on DailyDialog, achieving fluent but incoherent sentences due to capacity limitations. The project highlights the differences from autoregressive LLMs and the lessons learned from building and debugging the system.

0 favorites 0 likes

#masked-diffusion

Adaptive Order Policies for Masked Diffusion

arXiv cs.LG ↗ · 2026-06-02 Cached

Proposes learning the unmasking order in masked diffusion models using a lightweight policy network, with a weighted loss that outperforms heuristics on combinatorial tasks and protein design.

0 favorites 0 likes

#masked-diffusion

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

arXiv cs.CL ↗ · 2026-06-02 Cached

Introduces DLLM-JEPA, a JEPA formulation for masked diffusion language models that constructs two views from a single input via the diffusion noise schedule, reducing training FLOPs by 33% relative to LLM-JEPA and improving fine-tuning performance on tasks like GSM8K.

0 favorites 0 likes

#masked-diffusion

The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper identifies a failure mode in masked diffusion language models where confidence-based decoding leads to high-confidence errors on complex reasoning tasks, and shows that confidence-aligned training exacerbates this issue while random masking preserves reasoning performance.

0 favorites 0 likes

#masked-diffusion

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]

Reddit r/MachineLearning ↗ · 2026-05-21

This paper proposes using Masked Diffusion Language Models (MDLMs) as text-based world models for agentic reinforcement learning, showing that their any-order denoising objective avoids prefix mode collapse and leads to stronger performance than autoregressive baselines.

0 favorites 0 likes

#masked-diffusion

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

arXiv cs.AI ↗ · 2026-05-19 Cached

AnchorDiff proposes a topology-aware masked diffusion framework for radiology report generation, integrating RadGraph-derived clinical anchors and confidence-based rewriting to achieve state-of-the-art results on MIMIC-CXR and MIMIC-RG4 benchmarks.

0 favorites 0 likes

#masked-diffusion

Discrete Stochastic Localization for Non-autoregressive Generation

arXiv cs.LG ↗ · 2026-05-14 Cached

Introduces Discrete Stochastic Localization (DSL), a continuous-state diffusion framework for non-autoregressive text generation that uses unit-sphere token embeddings and a timestep-invariant denoiser, achieving better distributional faithfulness than masked discrete diffusion models on OpenWebText.

0 favorites 0 likes

#masked-diffusion

Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

arXiv cs.CL ↗ · 2026-04-22 Cached

Introduces Token-to-Mask (T2M) remasking to fix generation errors in masked diffusion LMs by resetting suspect tokens to mask state instead of overwriting, yielding up to +5.92 accuracy on CMATH without extra training or parameters.

0 favorites 0 likes

masked-diffusion

Submit Feedback