InterleaveThinker: Reinforcing Agentic Interleaved Generation

Hugging Face Daily Papers 06/11/26, 12:00 AM Papers

interleaved-generation multi-agent image-generation planner-agent critic-agent reinforcement-learning grpo

Summary

InterleaveThinker introduces a multi-agent pipeline with planner and critic agents to enable interleaved text-image generation for existing image generators, achieving performance comparable to state-of-the-art models and improving reasoning benchmarks.

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

Original Article

View Cached Full Text

Cached at: 06/12/26, 02:52 AM

Paper page - InterleaveThinker: Reinforcing Agentic Interleaved Generation

Source: https://huggingface.co/papers/2606.13679

Abstract

InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks.

Recentimage generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieveinterleaved generation(text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the firstmulti-agent pipelinedesigned to endow any existingimage generatorwithinterleaved generationcapabilities. Specifically, we employ aplanner agentto organize the image-text input sequence, instructing theimage generatoron the required execution at each step. Subsequently, we introduce acritic agentto evaluate the generator’s outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory usingGRPO. Since a singleinterleaved generationtrajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we proposeaccuracy rewardandstep-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across variousimage generators. Oninterleaved generationbenchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

View arXiv page View PDF Project page GitHub Add to collection

Get this paper in your agent:

hf papers read 2606\.13679

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper3

#### InterleaveThinker/InterleaveThinker-Planner-8B Image-Text-to-Text• 770k• Updated37 minutes ago #### InterleaveThinker/InterleaveThinker-Critic-8B Image-Text-to-Text• 9B• Updated37 minutes ago #### InterleaveThinker/Critic-SFT-8B Image-Text-to-Text• 770k• Updated37 minutes ago

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.13679 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.13679 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Paper page - InterleaveThinker: Reinforcing Agentic Interleaved Generation

Abstract

Models citing this paper3

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

ETCHR: Editing To Clarify and Harness Reasoning

Submit Feedback

Similar Articles

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

ETCHR: Editing To Clarify and Harness Reasoning