Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
Summary
This paper introduces INSET, a unified multimodal model that embeds images as native vocabulary within textual instructions to improve handling of complex interleaved inputs for image generation and editing.
View Cached Full Text
Cached at: 05/13/26, 04:11 AM
Paper page - Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
Source: https://huggingface.co/papers/2605.12305
Abstract
INSET is a unified multimodal model that embeds images as native vocabulary within textual instructions, enabling better handling of complex interleaved inputs through transformer-based contextual locality and supporting both image generation and editing tasks.
While recent advancements inmultimodal language modelshave enabledimage generationfrom expressive multi-image instructions, existing methods struggle to maintain performance under complexinterleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose Images iN SEnTences (a.k.a, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioningvisual featuresdirectly at their corresponding semantic slots, INSET leverages thecontextual localityoftransformersfor precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizingVLMsandLLMsto construct rich, long-horizon sequences. Evaluation results onInterleaveBenchdemonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodalimage editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.12305
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12305 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12305 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12305 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
The paper introduces JoyAI-Image, a unified multimodal foundation model that integrates a spatially enhanced MLLM with MMDiT to achieve state-of-the-art performance in visual understanding, text-to-image generation, and instruction-guided editing.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.
Boosting Visual Instruction Tuning with Self-Supervised Guidance
This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.
InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search
InterLV-Search is a new benchmark introduced in this paper to evaluate interleaved language-vision agentic search, highlighting limitations in current systems regarding visual evidence seeking and multimodal integration.
Hierarchical text-conditional image generation with CLIP latents
OpenAI proposes a hierarchical two-stage model for text-conditional image generation using CLIP latents: a prior that generates CLIP image embeddings from text captions, and a diffusion-based decoder that generates images from embeddings. The approach improves image diversity and enables zero-shot language-guided image manipulations.