UniGRPO:面向推理驱动视觉生成的统一策略优化
UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
March 24, 2026
作者: Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang
cs.AI
摘要
能够实现交错生成的一体化模型已成为一种前景广阔的范式,学界正逐渐趋同于采用自回归建模处理文本生成,而采用流匹配处理图像生成。为推进这一方向,我们提出了专为交错生成设计的统一强化学习框架。我们通过其基础单元验证方法:单轮推理驱动的图像生成,即模型先通过推理扩展用户提示词,再进行图像合成。通过将这一多模态生成过程建模为具有稀疏终端奖励的马尔可夫决策过程,我们提出UniGRPO框架,利用GRPO联合优化文本与图像生成策略。采用极简主义方法论避免过度设计,我们无缝整合标准GRPO(用于推理)与FlowGRPO(用于视觉合成),充分发挥两种模态的成熟训练方案。为确保可扩展至多轮交错生成,我们对原始FlowGRPO进行两项关键改进:(1)消除无分类器引导以维持线性无分支的决策轨迹,这对扩展至涉及多轮交互和多条件生成(如编辑)的复杂场景至关重要;(2)将标准潜在KL惩罚替换为速度场上的直接MSE惩罚,通过更鲁棒的直接正则化信号有效抑制奖励破解。实验表明,该统一训练方案通过推理显著提升图像生成质量,为未来全交错模型的训练后优化提供了鲁棒且可扩展的基线。
English
Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.