Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
Summary
This paper introduces INSET, a unified multimodal model that embeds images as native vocabulary within textual instructions to improve handling of complex interleaved inputs for image generation and editing.
View Cached Full Text
Cached at: 05/13/26, 04:11 AM
Paper page - Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
Source: https://huggingface.co/papers/2605.12305
Abstract
INSET is a unified multimodal model that embeds images as native vocabulary within textual instructions, enabling better handling of complex interleaved inputs through transformer-based contextual locality and supporting both image generation and editing tasks.
While recent advancements inmultimodal language modelshave enabledimage generationfrom expressive multi-image instructions, existing methods struggle to maintain performance under complexinterleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose Images iN SEnTences (a.k.a, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioningvisual featuresdirectly at their corresponding semantic slots, INSET leverages thecontextual localityoftransformersfor precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizingVLMsandLLMsto construct rich, long-horizon sequences. Evaluation results onInterleaveBenchdemonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodalimage editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.12305
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12305 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12305 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12305 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation
ILLUME-X is a unified multimodal model for free-form interleaved text-image generation, featuring improved data efficiency, stable training, and a comprehensive evaluation metric called ILScore. It outperforms previous models on tasks like style transfer, image decomposition, and storytelling.
InterleaveThinker: Reinforcing Agentic Interleaved Generation
InterleaveThinker introduces a multi-agent pipeline with planner and critic agents to enable interleaved text-image generation for existing image generators, achieving performance comparable to state-of-the-art models and improving reasoning benchmarks.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
The paper introduces JoyAI-Image, a unified multimodal foundation model that integrates a spatially enhanced MLLM with MMDiT to achieve state-of-the-art performance in visual understanding, text-to-image generation, and instruction-guided editing.
IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation
IV-CoT decomposes visual conditioning into structural and semantic cascades for improved structure-aware image generation, using training-only sketch supervision to guide structural queries. It achieves state-of-the-art results on GenEval and T2I-CompBench.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.