LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Hugging Face Daily Papers 04/22/26, 12:00 AM Papers

Summary

LLaDA2.0-Uni unifies multimodal understanding and generation within a single diffusion-based large language model architecture.

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

Original Article

View Cached Full Text

Cached at: 04/23/26, 03:35 AM

Paper page - LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Source: https://huggingface.co/papers/2604.20796 Published on Apr 22

#1 Paper of the day Authors:

Abstract

View arXiv page View PDF GitHub3 Add to collection

Models citing this paper1

#### inclusionAI/LLaDA2.0-Uni Image-Text-to-Text• 16B• Updated17 minutes ago • 8

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.20796 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.20796 in a Space README.md to link it from this page.

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Paper page - LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

Improved Large Language Diffusion Models

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

TextLDM: Language Modeling with Continuous Latent Diffusion

CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language

Submit Feedback

Similar Articles

Improved Large Language Diffusion Models

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

TextLDM: Language Modeling with Continuous Latent Diffusion

CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language