UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Hugging Face Daily Papers Papers

Summary

UniDDT proposes a decoupled diffusion transformer framework that unifies multimodal understanding and generation by leveraging a Noisy ViT encoder and LLM for semantic encoding, achieving strong performance on both tasks.

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:34 AM

Paper page - UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Source: https://huggingface.co/papers/2606.16255

Abstract

UniDDT addresses key challenges in unified multimodal models by leveraging a Noisy ViT encoder and LLM for semantic encoding while using separate diffusion decoders to balance visual understanding and generation tasks.

Unified Multimodal Models(UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages aNoisy ViT encoderalong with anLLMto unifysemantic encodingforvisual generationand understanding tasks, while employing a separatediffusion decoderto decouple diffusion decoding from text decoding. With thisNoisy ViT encoder, UniDDT is able to leverage thelatent spaceas a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we constructdual data structuresfrom the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification ofmultimodal understandingand generation with enhanced semantic consistency and scalability. Forvisual generationtasks, our UniDDT achieves 0.87GenEval scoreand 86.9 DPG overall score. Formultimodal understandingtasks, our UniDDT achieves 1699.5 score onMME benchmarkand 76.5 overall score onSEEDbench.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2606\.16255

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.16255 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.16255 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.16255 in a Space README.md to link it from this page.

Collections including this paper2

Similar Articles

TextLDM: Language Modeling with Continuous Latent Diffusion

Hugging Face Daily Papers

This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.