MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Hugging Face Daily Papers 06/15/26, 12:00 AM Papers

Summary

MMDiff extends frozen diffusion transformers into multi-modal generative systems using lightweight decoders, achieving significant improvements in semantic segmentation and other perceptual tasks through multi-timestep feature fusion.

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:32 AM

Paper page - MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Source: https://huggingface.co/papers/2606.16673

Abstract

MMDiff transforms frozen diffusion transformers into multi-modal generative systems that produce images and perceptual modalities using lightweight decoders, achieving improved semantic segmentation through multi-timestep feature fusion and spatial aggregation.

Diffusion transformershave demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across theirdenoising trajectoryare discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into amulti-modal generative systemthat jointly produces images alongside any combination of dense perceptual modalities usinglightweight decoder heads. Our central finding is that perceptual information is temporally distributed along thedenoising trajectory, and thatmulti-timestep feature fusionwithspatially varying aggregation weightsis essential, improvingsemantic segmentationresults by up to 28.7% mIoU over single-timestep extraction. We further adoptconcept-driven attention extractionfor interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such asDINOv3. By training onlylightweight decoder headson a frozen backbone, we achieve strong performance insemantic segmentation,salient object detection, anddepth estimation, and demonstrate that this framework enables effectivesynthetic data generationat scale.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2606\.16673

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.16673 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.16673 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.16673 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Paper page - MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Semantic DLM+: Improving Diffusion Language Models through Bias-variance Trade-off in Transition Kernel Design

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

Dynamic Chunking for Diffusion Language Models

Submit Feedback

Similar Articles

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Semantic DLM+: Improving Diffusion Language Models through Bias-variance Trade-off in Transition Kernel Design

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

Dynamic Chunking for Diffusion Language Models