Semantic Generative Tuning for Unified Multimodal Models
Summary
Introduces Semantic Generative Tuning (SGT), a paradigm that uses image segmentation as a generative proxy to align visual understanding and generation in unified multimodal models, improving both comprehension and fidelity.
View Cached Full Text
Cached at: 05/20/26, 06:37 AM
Paper page - Semantic Generative Tuning for Unified Multimodal Models
Source: https://huggingface.co/papers/2605.18714
Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement.
This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity.
Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity.
Similar Articles
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
The paper introduces JoyAI-Image, a unified multimodal foundation model that integrates a spatially enhanced MLLM with MMDiT to achieve state-of-the-art performance in visual understanding, text-to-image generation, and instruction-guided editing.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 is a unified multimodal model that achieves state-of-the-art performance by processing visual understanding and generation directly from pixel embeddings, eliminating the need for pretrained vision encoders.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit proposes using intelligent image editing as a single general task to simultaneously improve unified multimodal models' understanding, generation, and editing capabilities, with an automated data synthesis pipeline creating complex editing instructions.
Boosting Visual Instruction Tuning with Self-Supervised Guidance
This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.