LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
Summary
LLaDA2.0-Uni unifies multimodal understanding and generation within a single diffusion-based large language model architecture.
View Cached Full Text
Cached at: 04/23/26, 03:35 AM
Paper page - LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
Source: https://huggingface.co/papers/2604.20796 Published on Apr 22
#1 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
View arXiv pageView PDFGitHub3Add to collection
Models citing this paper1
#### inclusionAI/LLaDA2.0-Uni Image-Text-to-Text• 16B• Updated17 minutes ago • 8
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.20796 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.20796 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.
TextLDM: Language Modeling with Continuous Latent Diffusion
This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.
CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
CRoCoDiL proposes a continuous and robust conditioned diffusion approach for language that shifts masked diffusion models into a continuous semantic space, achieving superior generation quality and 10x faster sampling speeds compared to discrete methods like LLaDA.
Continuous Latent Diffusion Language Model
Cola DLM is a hierarchical latent diffusion language model that uses text-to-latent mapping and conditional decoding to achieve efficient, non-autoregressive text generation.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.