Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Hugging Face Daily Papers Papers

Summary

The paper introduces JoyAI-Image, a unified multimodal foundation model that integrates a spatially enhanced MLLM with MMDiT to achieve state-of-the-art performance in visual understanding, text-to-image generation, and instruction-guided editing.

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.
Original Article
View Cached Full Text

Cached at: 05/08/26, 08:08 AM

Paper page - Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Source: https://huggingface.co/papers/2605.04128 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

JoyAI-Image integrates a spatially enhanced MLLM with MMDiT to achieve unified visual understanding, text-to-image generation, and instruction-guided image editing with enhanced spatial intelligence.

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples aspatially enhancedMultimodal Large Language Model(MLLM) with aMultimodal Diffusion Transformer(MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combinesunified instruction tuning,long-text rendering supervision,spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning andcontrollable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, thebidirectional loopbetween enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward strongerspatial intelligence. These results suggest a promising path for unified visual models in downstream applications such asvision-language-action systemsandworld models.

View arXiv pageView PDFGitHub2.11kAdd to collection

Get this paper in your agent:

hf papers read 2605\.04128

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### jdopensource/JoyAI-Image-Edit Image-to-Image• Updated1 day ago • 6.02k • 119

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.04128 in a dataset README.md to link it from this page.

Spaces citing this paper3

Collections including this paper1

Similar Articles

Semantic Generative Tuning for Unified Multimodal Models

Hugging Face Daily Papers

Introduces Semantic Generative Tuning (SGT), a paradigm that uses image segmentation as a generative proxy to align visual understanding and generation in unified multimodal models, improving both comprehension and fidelity.