InterleaveThinker：强化智能体式的交错生成

摘要

近期图像生成器在单图像生成与编辑方面已展现出卓越的照片级真实感和指令遵循能力。然而受限于其架构设计，这些模型无法实现交错的图文序列生成——这一能力在视觉叙事、智能引导及具身操作等关键领域具有重要应用价值。即便最新的开源统一多模态模型在此类任务中表现亦相当有限。本文提出InterleaveThinker，这是首个赋予任意现有图像生成器交错生成能力的多智能体流水线。具体而言，我们设计规划器智能体来组织图文输入序列，指导图像生成器在每一步执行所需操作。随后引入评判器智能体评估生成器输出，识别偏离规划指令的样本，并修正指令以进行重新生成。为实现该流水线，我们构建了Interleave-Planner-SFT-80k和Interleave-Critic-SFT-112k数据集以完成格式冷启动，进而开发Interleave-Critic-RL-13k，通过GRPO算法强化生成轨迹中逐步指令修正能力。鉴于单条交错生成轨迹可能涉及超过25次生成器调用，优化完整轨迹在计算上不可行，因此我们提出准确率奖励与逐步奖励机制，使得单步强化学习能有效引导整个生成轨迹。实验结果表明，InterleaveThinker可显著提升多种图像生成器的性能。在交错生成基准测试中，其表现可达与Nano Banana及GPT-5相当的水平。令人惊讶的是，该模型还显著增强了基座模型在推理型基准测试中的表现——例如在4步FLUX.2-klein上，我们在WISE和RISE指标上均观察到显著提升。

English

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.