InterleaveThinker: エージェント的なインターリーブ生成の強化

要旨

近年の画像生成モデルは、単一画像の生成や編集において優れた写実性と指示追従能力を示しています。しかし、そのアーキテクチャ上の制約により、視覚的なナラティブ、ガイダンス、具現化操作において重要な応用を持つインタリーブ生成（テキスト-画像シーケンス）を実現できません。最新のオープンソースの統一マルチモーダルモデル（UMM）でさえ、この点では限られた性能しか示していません。本論文では、既存の任意の画像生成モデルにインタリーブ生成能力を付与するために設計された初のマルチエージェントパイプラインであるInterleaveThinkerを提案します。具体的には、プランナーエージェントを用いて画像-テキスト入力シーケンスを整理し、各ステップで必要な実行を画像生成モデルに指示します。続いて、クリティックエージェントを導入し、生成モデルの出力を評価して計画された指示から逸脱したサンプルを特定し、再生成のために指示を改善します。このパイプラインを実装するため、フォーマットのコールドスタートを実行するInterleave-Planner-SFT-80kとInterleave-Critic-SFT-112kを構築します。次に、GRPOを用いて生成軌跡内でのステップ単位の指示修正能力を強化するため、Interleave-Critic-RL-13kを開発します。単一のインタリーブ生成軌跡には25回以上の生成モデル呼び出しが含まれる可能性があるため、軌跡全体の最適化は計算的に非現実的です。そこで、精度報酬とステップ単位報酬を提案し、単一ステップのRLが生成軌跡全体を効果的に導くことを可能にします。結果は、InterleaveThinkerが様々な画像生成モデルにおいて性能を向上させることを示しています。インタリーブ生成ベンチマークでは、Nano BananaやGPT-5に匹敵する性能を達成します。驚くべきことに、推論ベースのベンチマークにおいてもベースモデルを大幅に向上させ、例えば4ステップのFLUX.2-kleinにおいてWISEとRISEで顕著な向上が観察されます。

English

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.