MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Hugging Face Daily Papers Papers

Summary

MMDiff extends frozen diffusion transformers into multi-modal generative systems using lightweight decoders, achieving significant improvements in semantic segmentation and other perceptual tasks through multi-timestep feature fusion.

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:32 AM

Paper page - MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Source: https://huggingface.co/papers/2606.16673

Abstract

MMDiff transforms frozen diffusion transformers into multi-modal generative systems that produce images and perceptual modalities using lightweight decoders, achieving improved semantic segmentation through multi-timestep feature fusion and spatial aggregation.

Diffusion transformershave demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across theirdenoising trajectoryare discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into amulti-modal generative systemthat jointly produces images alongside any combination of dense perceptual modalities usinglightweight decoder heads. Our central finding is that perceptual information is temporally distributed along thedenoising trajectory, and thatmulti-timestep feature fusionwithspatially varying aggregation weightsis essential, improvingsemantic segmentationresults by up to 28.7% mIoU over single-timestep extraction. We further adoptconcept-driven attention extractionfor interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such asDINOv3. By training onlylightweight decoder headson a frozen backbone, we achieve strong performance insemantic segmentation,salient object detection, anddepth estimation, and demonstrate that this framework enables effectivesynthetic data generationat scale.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2606\.16673

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.16673 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.16673 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.16673 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Dynamic Chunking for Diffusion Language Models

arXiv cs.CL

This paper introduces Dynamic Chunking for Diffusion Language Models (DCDM), which replaces fixed positional blocks in block discrete diffusion with content-defined semantic chunks using a differentiable Chunking Attention mechanism, achieving consistent improvements across scales up to 1.5B parameters.