UniGRPO：推論主導型ビジュアル生成のための統合ポリシー最適化

要旨

テキストと画像の生成を交互に行う統合モデルは有望なパラダイムとして登場し、学界ではテキスト生成には自己回帰モデル、画像生成にはフローマッチングを採用する方向に収束しつつある。この方向性を推進するため、我々は交互生成に特化した統合強化学習フレームワークを提案する。本アプローチをその基本単位である「単一ラウンドの推論駆動型画像生成」で検証する。これは、モデルが最初にユーザープロンプトを推論によって拡張し、続いて画像合成を行うプロセスである。このマルチモーダル生成プロセスを疎な終端報酬を持つマルコフ決定過程として定式化し、GRPOを用いてテキストと画像の生成ポリシーを共同最適化するUniGRPOを導入する。過剰な設計を避けるミニマリスト手法を採用し、推論には標準GRPOを、視覚的合成にはFlowGRPOをシームレスに統合することで、両モダリティにおける確立された訓練レシピを活用する。複数ラウンドの交互生成へのスケーラビリティを確保するため、元のFlowGRPOに2つの重要な修正を加える：（1）マルチターン相互作用やマルチ条件生成（編集など）を含む複雑なシナリオへのスケーリングに不可欠な、線形で分岐のないロールアウトを維持するため、分類器不要ガイダンスを排除；（2）潜在空間のKLペナルティを速度場への直接的なMSEペナルティに置き換え、報酬ハッキングを効果的に緩和するため、よりロバストで直接的な正則化信号を提供する。実験により、この統合訓練レシピが推論を通じて画像生成品質を大幅に向上させることが実証され、完全な交互生成モデルの将来の事後訓練に向けた堅牢でスケーラブルなベースラインを提供する。

English

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

UniGRPO：推論主導型ビジュアル生成のための統合ポリシー最適化

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

要旨

Support