UniGRPO:推理驅動視覺生成的統一策略優化
UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
March 24, 2026
作者: Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang
cs.AI
摘要
能夠進行交錯生成的統一模型已成為一個前景廣闊的範式,學術界逐漸趨向於採用自迴歸建模處理文本生成,並以流匹配技術處理圖像生成。為推進此方向,我們提出了一個專為交錯生成設計的統一強化學習框架。我們以該框架的基本單元——單輪推理驅動的圖像生成——進行驗證:模型先通過推理擴展用戶提示詞,再進行圖像合成。通過將此多模態生成過程建模為具有稀疏終端獎勵的馬爾可夫決策過程,我們提出UniGRPO框架,利用GRPO聯合優化文本與圖像生成策略。秉持極簡主義方法以避免過度設計,我們無縫整合標準GRPO用於推理和FlowGRPO用於視覺合成,從而充分利用兩種模態的成熟訓練方案。為確保擴展至多輪交錯生成的可行性,我們對原始FlowGRPO進行兩項關鍵改進:(1)取消無分類器引導機制,以維持線性、無分支的決策軌跡,這對擴展至涉及多輪交互與多條件生成(如編輯)的複雜場景至關重要;(2)將潛空間KL懲罰項替換為對速度場直接施加的MSE懲罰項,通過更魯棒且直觀的正則化信號有效抑制獎勵破解現象。實驗表明,此統一訓練方案能通過推理顯著提升圖像生成質量,為未來完全交錯模型的訓練後優化提供了可擴展的強健基線。
English
Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.