UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer
Summary
UniDDT proposes a decoupled diffusion transformer framework that unifies multimodal understanding and generation by leveraging a Noisy ViT encoder and LLM for semantic encoding, achieving strong performance on both tasks.
View Cached Full Text
Cached at: 06/16/26, 11:34 AM
Paper page - UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer
Source: https://huggingface.co/papers/2606.16255
Abstract
UniDDT addresses key challenges in unified multimodal models by leveraging a Noisy ViT encoder and LLM for semantic encoding while using separate diffusion decoders to balance visual understanding and generation tasks.
Unified Multimodal Models(UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages aNoisy ViT encoderalong with anLLMto unifysemantic encodingforvisual generationand understanding tasks, while employing a separatediffusion decoderto decouple diffusion decoding from text decoding. With thisNoisy ViT encoder, UniDDT is able to leverage thelatent spaceas a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we constructdual data structuresfrom the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification ofmultimodal understandingand generation with enhanced semantic consistency and scalability. Forvisual generationtasks, our UniDDT achieves 0.87GenEval scoreand 86.9 DPG overall score. Formultimodal understandingtasks, our UniDDT achieves 1699.5 score onMME benchmarkand 76.5 overall score onSEEDbench.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.16255
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.16255 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.16255 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.16255 in a Space README.md to link it from this page.
Collections including this paper2
Similar Articles
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation within a single diffusion-based large language model architecture.
MMDiff: Extending Diffusion Transformers for Multi-Modal Generation
MMDiff extends frozen diffusion transformers into multi-modal generative systems using lightweight decoders, achieving significant improvements in semantic segmentation and other perceptual tasks through multi-timestep feature fusion.
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
The article discusses the UniVidX paper, which introduces a unified multimodal framework for video generation using diffusion priors and discusses its cross-modal coherence mechanisms.
Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
This paper proposes Decoupled Residual Denoising Diffusion Models (DRDD) for unified and data-efficient image-to-image translation, decoupling noise diffusion for domain harmonization from residual diffusion for semantic mapping.
TextLDM: Language Modeling with Continuous Latent Diffusion
This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.