InterleaveThinker: 에이전트적 인터리브 생성 강화

초록

최근 이미지 생성 모델들은 단일 이미지 생성 및 편집에서 인상적인 사실성과 명령 수행 능력을 입증해 왔다. 그러나 구조적 한계로 인해 시각적 내러티브, 가이드라인, 임베디드 조작 등에 중요한 응용을 지닌 인터리브 생성(텍스트-이미지 시퀀스)을 구현할 수 없다. 심지어 최신 오픈소스 통합 멀티모달 모델(UMM)들도 이 측면에서 제한된 성능을 보인다. 본 논문에서는 기존의 모든 이미지 생성기에 인터리브 생성 능력을 부여하도록 설계된 최초의 다중 에이전트 파이프라인인 InterleaveThinker를 소개한다. 구체적으로, 계획 에이전트를 활용하여 이미지-텍스트 입력 시퀀스를 구성하고, 각 단계에서 이미지 생성기에 필요한 실행을 지시한다. 이후 비판 에이전트를 도입하여 생성기의 출력을 평가하고, 계획된 지시사항에서 벗어난 샘플을 식별한 뒤 재생성을 위한 지시사항을 개선한다. 이 파이프라인을 구현하기 위해 Interleave-Planner-SFT-80k와 Interleave-Critic-SFT-112k를 구축하여 형식적 콜드 스타트를 수행한다. 이후 GRPO를 사용하여 생성 궤적 내에서 단계별 지시 수정 능력을 강화하는 Interleave-Critic-RL-13k를 개발한다. 단일 인터리브 생성 궤적은 25회 이상의 생성기 호출을 포함할 수 있기 때문에 전체 궤적을 최적화하는 것은 계산적으로 비실용적이다. 따라서 정확도 보상과 단계별 보상을 제안하여 단일 단계 강화 학습이 전체 생성 궤적을 효과적으로 유도할 수 있도록 한다. 실험 결과, InterleaveThinker는 다양한 이미지 생성기에서 성능을 향상시킨다. 인터리브 생성 벤치마크에서는 Nano Banana 및 GPT-5와 견줄 만한 성능을 달성한다. 놀랍게도, 추론 기반 벤치마크에서도 기본 모델을 크게 향상시킨다. 예를 들어, 4단계 FLUX.2-klein에서 WISE 및 RISE에서 상당한 성능 향상을 관찰했다.

English

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.