GRPO-Guard:通过受控截断缓解流匹配中的隐式过优化问题
GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
October 25, 2025
作者: Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang
cs.AI
摘要
近期,基于GRPO的强化学习在优化流匹配模型方面取得显著进展,有效提升了模型与任务特定奖励的匹配度。这类框架通过重要性比例剪裁机制约束过度自信的正负梯度更新。然而实践中我们发现,重要性比例分布存在系统性偏移——其均值低于1且不同时间步的方差差异显著。这种左偏且不稳定的分布阻止了正优势样本进入剪裁区域,导致机制无法有效约束过度自信的正向更新。因此策略模型不可避免地进入隐式过优化阶段:虽然代理奖励持续上升,但图像质量、文本提示对齐等关键指标急剧恶化,最终使学习到的策略无法实际应用。
为解决该问题,我们提出GRPO-Guard——一种对现有GRPO框架简单而有效的增强方案。该方法通过比率归一化技术重建平衡且时序一致的重要性比例,确保PPO剪裁机制能有效约束去噪过程中所有时间步的有害更新。同时,梯度重加权策略通过均衡不同噪声条件下的策略梯度,防止特定时间步区域产生过度更新。这些设计共同构成受控剪裁机制,在无需重度KL正则化的情况下稳定优化过程,显著缓解隐式过优化现象。基于多种扩散模型骨干(如SD3.5M、Flux.1-dev)和多样化代理任务的实验表明,GRPO-Guard在维持甚至提升生成质量的同时,能显著降低过优化现象。
English
Recently, GRPO-based reinforcement learning has shown remarkable progress in
optimizing flow-matching models, effectively improving their alignment with
task-specific rewards. Within these frameworks, the policy update relies on
importance-ratio clipping to constrain overconfident positive and negative
gradients. However, in practice, we observe a systematic shift in the
importance-ratio distribution-its mean falls below 1 and its variance differs
substantially across timesteps. This left-shifted and inconsistent distribution
prevents positive-advantage samples from entering the clipped region, causing
the mechanism to fail in constraining overconfident positive updates. As a
result, the policy model inevitably enters an implicit over-optimization
stage-while the proxy reward continues to increase, essential metrics such as
image quality and text-prompt alignment deteriorate sharply, ultimately making
the learned policy impractical for real-world use. To address this issue, we
introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO
frameworks. Our method incorporates ratio normalization, which restores a
balanced and step-consistent importance ratio, ensuring that PPO clipping
properly constrains harmful updates across denoising timesteps. In addition, a
gradient reweighting strategy equalizes policy gradients over noise conditions,
preventing excessive updates from particular timestep regions. Together, these
designs act as a regulated clipping mechanism, stabilizing optimization and
substantially mitigating implicit over-optimization without relying on heavy KL
regularization. Extensive experiments on multiple diffusion backbones (e.g.,
SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard
significantly reduces over-optimization while maintaining or even improving
generation quality.