Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Hugging Face Daily Papers 05/07/26, 12:00 AM Papers

Summary

This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.

Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducing Understanding-Oriented Post-Training (UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steer generative representations. By incorporating objectives that encode semantic abstraction (captioning) and structural details (visual regression), we enable effective gradient flow from understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.

Original Article

View Cached Full Text

Cached at: 05/11/26, 07:20 AM

Paper page - Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Source: https://huggingface.co/papers/2605.05781

Abstract

Understanding-oriented post-training framework enhances generative models by using comprehension tasks as supervisory signals for improved image generation and editing.

Unified multimodal modelsare envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducingUnderstanding-Oriented Post-Training(UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steergenerative representations. By incorporating objectives that encodesemantic abstraction(captioning) and structural details (visual regression), we enable effectivegradient flowfrom understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.05781

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.05781 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.05781 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.05781 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Paper page - Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Semantic Generative Tuning for Unified Multimodal Models

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Submit Feedback

Similar Articles

Semantic Generative Tuning for Unified Multimodal Models

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Boosting Visual Instruction Tuning with Self-Supervised Guidance