Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
Summary
The paper introduces JoyAI-Image, a unified multimodal foundation model that integrates a spatially enhanced MLLM with MMDiT to achieve state-of-the-art performance in visual understanding, text-to-image generation, and instruction-guided editing.
View Cached Full Text
Cached at: 05/08/26, 08:08 AM
Paper page - Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
Source: https://huggingface.co/papers/2605.04128 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
JoyAI-Image integrates a spatially enhanced MLLM with MMDiT to achieve unified visual understanding, text-to-image generation, and instruction-guided image editing with enhanced spatial intelligence.
We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples aspatially enhancedMultimodal Large Language Model(MLLM) with aMultimodal Diffusion Transformer(MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combinesunified instruction tuning,long-text rendering supervision,spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning andcontrollable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, thebidirectional loopbetween enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward strongerspatial intelligence. These results suggest a promising path for unified visual models in downstream applications such asvision-language-action systemsandworld models.
View arXiv pageView PDFGitHub2.11kAdd to collection
Get this paper in your agent:
hf papers read 2605\.04128
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### jdopensource/JoyAI-Image-Edit Image-to-Image• Updated1 day ago • 6.02k • 119
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.04128 in a dataset README.md to link it from this page.
Spaces citing this paper3
Collections including this paper1
Similar Articles
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.
Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
This paper introduces INSET, a unified multimodal model that embeds images as native vocabulary within textual instructions to improve handling of complex interleaved inputs for image generation and editing.
Semantic Generative Tuning for Unified Multimodal Models
Introduces Semantic Generative Tuning (SGT), a paradigm that uses image segmentation as a generative proxy to align visual understanding and generation in unified multimodal models, improving both comprehension and fidelity.