Semantic Generative Tuning for Unified Multimodal Models

Hugging Face Daily Papers 05/18/26, 12:00 AM Papers

multimodal generative-tuning segmentation unified-models vision-language ai-research

Summary

Introduces Semantic Generative Tuning (SGT), a paradigm that uses image segmentation as a generative proxy to align visual understanding and generation in unified multimodal models, improving both comprehension and fidelity.

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.

Original Article

View Cached Full Text

Cached at: 05/20/26, 06:37 AM

Paper page - Semantic Generative Tuning for Unified Multimodal Models

Source: https://huggingface.co/papers/2605.18714 Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement.

This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity.

Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity.

Semantic Generative Tuning for Unified Multimodal Models

Paper page - Semantic Generative Tuning for Unified Multimodal Models

Similar Articles

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Submit Feedback

Similar Articles

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Boosting Visual Instruction Tuning with Self-Supervised Guidance