Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

Summary

This paper introduces INSET, a unified multimodal model that embeds images as native vocabulary within textual instructions to improve handling of complex interleaved inputs for image generation and editing.

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose Images iN SEnTences (a.k.a, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/13/26, 04:11 AM

Paper page - Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Source: https://huggingface.co/papers/2605.12305

Abstract

INSET is a unified multimodal model that embeds images as native vocabulary within textual instructions, enabling better handling of complex interleaved inputs through transformer-based contextual locality and supporting both image generation and editing tasks.

While recent advancements inmultimodal language modelshave enabledimage generationfrom expressive multi-image instructions, existing methods struggle to maintain performance under complexinterleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose Images iN SEnTences (a.k.a, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioningvisual featuresdirectly at their corresponding semantic slots, INSET leverages thecontextual localityoftransformersfor precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizingVLMsandLLMsto construct rich, long-horizon sequences. Evaluation results onInterleaveBenchdemonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodalimage editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.12305

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12305 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12305 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12305 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Paper page - Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Boosting Visual Instruction Tuning with Self-Supervised Guidance

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Hierarchical text-conditional image generation with CLIP latents

Submit Feedback

Similar Articles

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

Boosting Visual Instruction Tuning with Self-Supervised Guidance

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Hierarchical text-conditional image generation with CLIP latents