Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
Summary
This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.
View Cached Full Text
Cached at: 05/11/26, 07:20 AM
Paper page - Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
Source: https://huggingface.co/papers/2605.05781
Abstract
Understanding-oriented post-training framework enhances generative models by using comprehension tasks as supervisory signals for improved image generation and editing.
Unified multimodal modelsare envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducingUnderstanding-Oriented Post-Training(UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steergenerative representations. By incorporating objectives that encodesemantic abstraction(captioning) and structural details (visual regression), we enable effectivegradient flowfrom understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.05781
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.05781 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.05781 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.05781 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Semantic Generative Tuning for Unified Multimodal Models
Introduces Semantic Generative Tuning (SGT), a paradigm that uses image segmentation as a generative proxy to align visual understanding and generation in unified multimodal models, improving both comprehension and fidelity.
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
UniPath proposes a framework for adaptive coordination of understanding and generation in unified multimodal models, leveraging coordination-path diversity to improve performance over fixed strategies.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
The paper introduces JoyAI-Image, a unified multimodal foundation model that integrates a spatially enhanced MLLM with MMDiT to achieve state-of-the-art performance in visual understanding, text-to-image generation, and instruction-guided editing.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit proposes using intelligent image editing as a single general task to simultaneously improve unified multimodal models' understanding, generation, and editing capabilities, with an automated data synthesis pipeline creating complex editing instructions.
Boosting Visual Instruction Tuning with Self-Supervised Guidance
This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.