LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Hugging Face Daily Papers Papers

Summary

LLaDA2.0-Uni unifies multimodal understanding and generation within a single diffusion-based large language model architecture.

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.
Original Article
View Cached Full Text

Cached at: 04/23/26, 03:35 AM

Paper page - LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Source: https://huggingface.co/papers/2604.20796 Published on Apr 22

#1 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

View arXiv pageView PDFGitHub3Add to collection

Models citing this paper1

#### inclusionAI/LLaDA2.0-Uni Image-Text-to-Text• 16B• Updated17 minutes ago • 8

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.20796 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.20796 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

TextLDM: Language Modeling with Continuous Latent Diffusion

Hugging Face Daily Papers

This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.

CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language

arXiv cs.CL

CRoCoDiL proposes a continuous and robust conditioned diffusion approach for language that shifts masked diffusion models into a continuous semantic space, achieving superior generation quality and 10x faster sampling speeds compared to discrete methods like LLaDA.

Continuous Latent Diffusion Language Model

Hugging Face Daily Papers

Cola DLM is a hierarchical latent diffusion language model that uses text-to-latent mapping and conditional decoding to achieve efficient, non-autoregressive text generation.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Hugging Face Daily Papers

This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.