ChatPaper.aiChatPaper

GRPO-Guard:通过调控剪裁缓解流匹配中的隐式过度优化问题

GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

October 25, 2025
作者: Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang
cs.AI

摘要

近期,基于GRPO的强化学习在优化流匹配模型方面取得显著进展,有效提升了模型与任务特定奖励的匹配度。这类框架通过重要性比率裁剪机制约束过度自信的正负梯度更新。然而实践中我们发现,重要性比率分布存在系统性偏移——其均值低于1且方差在不同时间步间差异显著。这种左偏且不稳定的分布阻止了正优势样本进入裁剪区域,导致机制无法有效约束过度自信的正向更新。因此策略模型不可避免地进入隐式过优化阶段:虽然代理奖励持续上升,但图像质量、文本提示对齐等关键指标急剧恶化,最终使学习到的策略无法实际应用。针对此问题,我们提出GRPO-Guard——一种对现有GRPO框架简单而有效的增强方案。该方法通过比率归一化技术重建平衡且步数一致的重要性比率,确保PPO裁剪机制能在去噪时间步中有效约束有害更新;同时采用梯度重加权策略均衡不同噪声条件下的策略梯度,防止特定时间步区域产生过度更新。这些设计共同构成受控裁剪机制,在避免沉重KL正则化的前提下稳定优化过程,显著缓解隐式过优化现象。基于多种扩散骨干网络(如SD3.5M、Flux.1-dev)和多样化代理任务的实验表明,GRPO-Guard在维持甚至提升生成质量的同时,能显著降低过优化现象。
English
Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
PDF21December 1, 2025