MMDiff: Extending Diffusion Transformers for Multi-Modal Generation
Summary
MMDiff extends frozen diffusion transformers into multi-modal generative systems using lightweight decoders, achieving significant improvements in semantic segmentation and other perceptual tasks through multi-timestep feature fusion.
View Cached Full Text
Cached at: 06/16/26, 11:32 AM
Paper page - MMDiff: Extending Diffusion Transformers for Multi-Modal Generation
Source: https://huggingface.co/papers/2606.16673
Abstract
MMDiff transforms frozen diffusion transformers into multi-modal generative systems that produce images and perceptual modalities using lightweight decoders, achieving improved semantic segmentation through multi-timestep feature fusion and spatial aggregation.
Diffusion transformershave demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across theirdenoising trajectoryare discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into amulti-modal generative systemthat jointly produces images alongside any combination of dense perceptual modalities usinglightweight decoder heads. Our central finding is that perceptual information is temporally distributed along thedenoising trajectory, and thatmulti-timestep feature fusionwithspatially varying aggregation weightsis essential, improvingsemantic segmentationresults by up to 28.7% mIoU over single-timestep extraction. We further adoptconcept-driven attention extractionfor interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such asDINOv3. By training onlylightweight decoder headson a frozen backbone, we achieve strong performance insemantic segmentation,salient object detection, anddepth estimation, and demonstrate that this framework enables effectivesynthetic data generationat scale.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.16673
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.16673 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.16673 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.16673 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer
UniDDT proposes a decoupled diffusion transformer framework that unifies multimodal understanding and generation by leveraging a Noisy ViT encoder and LLM for semantic encoding, achieving strong performance on both tasks.
Semantic DLM+: Improving Diffusion Language Models through Bias-variance Trade-off in Transition Kernel Design
This paper theoretically analyzes diffusion language models through a bias-variance lens, identifying trade-offs between masking and uniform diffusion kernels. It proposes SemDLM+, which adds a global transition and semantic-frequency penalty to overcome the semantic basin problem, achieving competitive generation quality on LM1B and OpenWebText benchmarks.
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
This paper introduces Live Music Diffusion Models (LMDMs), which modify the diffusion process to enable efficient block-wise processing and novel training paradigms for real-time interactive music generation on consumer hardware, outperforming discrete autoregressive models in inference complexity and enabling stable post-training alignment.
Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
This paper proposes Decoupled Residual Denoising Diffusion Models (DRDD) for unified and data-efficient image-to-image translation, decoupling noise diffusion for domain harmonization from residual diffusion for semantic mapping.
Dynamic Chunking for Diffusion Language Models
This paper introduces Dynamic Chunking for Diffusion Language Models (DCDM), which replaces fixed positional blocks in block discrete diffusion with content-defined semantic chunks using a differentiable Chunking Attention mechanism, achieving consistent improvements across scales up to 1.5B parameters.