F-GRPO：勿让策略习得常见而遗忘罕见

摘要

基于可验证奖励的强化学习（RLVR）通常采用分组采样来估计优势函数并稳定策略更新。实践中，由于计算资源限制，大分组规模不可行，这会导致学习过程偏向已有高概率轨迹。小分组虽能包含混合奖励信号，却常遗漏稀有正确轨迹，使概率质量集中于常见解。我们推导了更新过程遗漏稀有正确模式的概率与分组规模的函数关系，揭示其非单调特性，并刻画了更新在正确解集内重新分配概率质量的机制，发现未采样的正确解概率质量可能缩减，即使总体正确解概率质量在增长。受此分析启发，我们借鉴焦点损失思想提出难度感知的优势缩放系数，对高成功率提示的更新进行降权处理。这种轻量级改进可直接集成至GRPO、DAPO、CISPO等分组相对RLVR算法。在Qwen2.5-7B模型上的域内外基准测试表明，该方法在保持分组规模与计算成本不变的前提下，将pass@256指标从64.1提升至70.3（GRPO）、69.3提升至72.5（DAPO）、73.2提升至76.8（CISPO），同时保持或改进了pass@1性能。

English

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating probability on common solutions. We derive the probability that updates miss rare-correct modes as a function of group size, showing non-monotonic behavior, and characterize how updates redistribute mass within the correct set, revealing that unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware advantage scaling coefficient, inspired by Focal loss, that down-weights updates on high-success prompts. The lightweight modification can be directly integrated into any group-relative RLVR algorithm such as GRPO, DAPO, and CISPO. On Qwen2.5-7B across in-domain and out-of-domain benchmarks, our method improves pass@256 from 64.1 rightarrow 70.3 (GRPO), 69.3 rightarrow 72.5 (DAPO), and 73.2 rightarrow 76.8 (CISPO), while preserving or improving pass@1, without increasing group size or computational cost.