Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Hugging Face Daily Papers Papers

Summary

This paper introduces INSET, a unified multimodal model that embeds images as native vocabulary within textual instructions to improve handling of complex interleaved inputs for image generation and editing.

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose Images iN SEnTences (a.k.a, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 04:11 AM

Paper page - Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Source: https://huggingface.co/papers/2605.12305

Abstract

INSET is a unified multimodal model that embeds images as native vocabulary within textual instructions, enabling better handling of complex interleaved inputs through transformer-based contextual locality and supporting both image generation and editing tasks.

While recent advancements inmultimodal language modelshave enabledimage generationfrom expressive multi-image instructions, existing methods struggle to maintain performance under complexinterleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose Images iN SEnTences (a.k.a, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioningvisual featuresdirectly at their corresponding semantic slots, INSET leverages thecontextual localityoftransformersfor precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizingVLMsandLLMsto construct rich, long-horizon sequences. Evaluation results onInterleaveBenchdemonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodalimage editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.12305

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12305 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12305 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12305 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Hugging Face Daily Papers

This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.

Hierarchical text-conditional image generation with CLIP latents

OpenAI Blog

OpenAI proposes a hierarchical two-stage model for text-conditional image generation using CLIP latents: a prior that generates CLIP image embeddings from text captions, and a diffusion-based decoder that generates images from embeddings. The approach improves image diversity and enables zero-shot language-guided image manipulations.