InterleaveThinker: Reinforcing Agentic Interleaved Generation

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-im...

Signal 40

Source Confidence 78%

Claim Status: developing

Source Evidence

Developing

Signal 40

Source Confidence 78%

Primary Source

Hugging Face (Dian Zheng)

huggingface.co

Source Type

research

Published Time

6/10/2026, 8:00:00 PM

Engine Timestamps

Fetched: about 2 hours ago

Last Checked: about 2 hours ago

What Changed

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-im...

Why It Matters

GPT is tied to model releases; model capability and access changes can shift product roadmaps, developer choices, and competitive pressure across the AI stack.

Confirmed Facts

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

Who Is Affected

GPT
AI product teams

What To Watch Next

Watch for benchmark validation, API availability, pricing, limits, and early customer adoption.
Look for corroboration from an official source or a second reliable report.
Watch whether additional sources confirm the same claim.

Still Developing

The claim is plausible but still developing.

Read Original Source

You will be redirected to huggingface.co.