Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Hugging Face Daily Papers 05/05/26, 12:00 AM Papers

Summary

The paper introduces JoyAI-Image, a unified multimodal foundation model that integrates a spatially enhanced MLLM with MMDiT to achieve state-of-the-art performance in visual understanding, text-to-image generation, and instruction-guided editing.

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

Original Article

View Cached Full Text

Cached at: 05/08/26, 08:08 AM

Paper page - Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Source: https://huggingface.co/papers/2605.04128 Authors:

Abstract

JoyAI-Image integrates a spatially enhanced MLLM with MMDiT to achieve unified visual understanding, text-to-image generation, and instruction-guided image editing with enhanced spatial intelligence.

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples aspatially enhanced Multimodal Large Language Model(MLLM) with aMultimodal Diffusion Transformer(MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combinesunified instruction tuning,long-text rendering supervision,spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning andcontrollable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, thebidirectional loopbetween enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward strongerspatial intelligence. These results suggest a promising path for unified visual models in downstream applications such asvision-language-action systemsandworld models.

View arXiv page View PDF GitHub2.11k Add to collection

Get this paper in your agent:

hf papers read 2605\.04128

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### jdopensource/JoyAI-Image-Edit Image-to-Image• Updated1 day ago • 6.02k • 119

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.04128 in a dataset README.md to link it from this page.

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Paper page - Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper3

Collections including this paper1

Similar Articles

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Semantic Generative Tuning for Unified Multimodal Models

Submit Feedback

Similar Articles

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Semantic Generative Tuning for Unified Multimodal Models