LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
Summary
LLaDA2.0-Uni unifies multimodal understanding and generation within a single diffusion-based large language model architecture.
View Cached Full Text
Cached at: 04/23/26, 03:35 AM
Paper page - LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
Source: https://huggingface.co/papers/2604.20796 Published on Apr 22
#1 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
View arXiv pageView PDFGitHub3Add to collection
Models citing this paper1
#### inclusionAI/LLaDA2.0-Uni Image-Text-to-Text• 16B• Updated17 minutes ago • 8
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.20796 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.20796 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Improved Large Language Diffusion Models
iLLaDA is an 8B parameter masked diffusion language model with fully bidirectional attention, trained from scratch on 12T tokens. It shows broad improvements over LLaDA and remains competitive with Qwen2.5 7B on several benchmarks. The model and code are open-sourced.
UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer
UniDDT proposes a decoupled diffusion transformer framework that unifies multimodal understanding and generation by leveraging a Noisy ViT encoder and LLM for semantic encoding, achieving strong performance on both tasks.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.
TextLDM: Language Modeling with Continuous Latent Diffusion
This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.
CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
CRoCoDiL proposes a continuous and robust conditioned diffusion approach for language that shifts masked diffusion models into a continuous semantic space, achieving superior generation quality and 10x faster sampling speeds compared to discrete methods like LLaDA.