InterleaveThinker: Reinforcing Agentic Interleaved Generation
Summary
InterleaveThinker introduces a multi-agent pipeline with planner and critic agents to enable interleaved text-image generation for existing image generators, achieving performance comparable to state-of-the-art models and improving reasoning benchmarks.
View Cached Full Text
Cached at: 06/12/26, 02:52 AM
Paper page - InterleaveThinker: Reinforcing Agentic Interleaved Generation
Source: https://huggingface.co/papers/2606.13679
Abstract
InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks.
Recentimage generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieveinterleaved generation(text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the firstmulti-agent pipelinedesigned to endow any existingimage generatorwithinterleaved generationcapabilities. Specifically, we employ aplanner agentto organize the image-text input sequence, instructing theimage generatoron the required execution at each step. Subsequently, we introduce acritic agentto evaluate the generator’s outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory usingGRPO. Since a singleinterleaved generationtrajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we proposeaccuracy rewardandstep-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across variousimage generators. Oninterleaved generationbenchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2606\.13679
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper3
#### InterleaveThinker/InterleaveThinker-Planner-8B Image-Text-to-Text• 770k• Updated37 minutes ago
#### InterleaveThinker/InterleaveThinker-Critic-8B Image-Text-to-Text• 9B• Updated37 minutes ago
#### InterleaveThinker/Critic-SFT-8B Image-Text-to-Text• 770k• Updated37 minutes ago
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.13679 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.13679 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
This paper introduces INSET, a unified multimodal model that embeds images as native vocabulary within textual instructions to improve handling of complex interleaved inputs for image generation and editing.
Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation
This paper introduces InterRS, a method for real-time speech generation that interleaves reasoning steps during natural pauses in speech, achieving better performance on math and logic benchmarks while maintaining fluent and instant responses.
TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens
The paper introduces TTE-Flash, a method that replaces explicit chain-of-thought reasoning with latent think tokens to generate reasoning-aware multimodal representations at constant inference cost, outperforming explicit CoT baselines on the MMEB-v2 benchmark.
Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning
Visual Para-Thinker++ proposes a single-policy multi-agent framework for visual reasoning that uses role-conditioned agents (Main, Worker, Summary) and dedicated training methods to reduce hallucinations and improve efficiency, outperforming baselines on hallucination-sensitive benchmarks.
ETCHR: Editing To Clarify and Harness Reasoning
ETCHR is a novel image editing approach that decouples visual reasoning from image generation, using a two-stage training process (Reasoning Imitation and Reasoning Enhancement) to improve multimodal language model performance across five visual reasoning tasks. It achieves consistent gains of 4-5% Pass@1 on models like Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5.