F-GRPO：勿讓策略習於顯見而疏於罕例

摘要

基於可驗證獎勵的強化學習（RLVR）通常依賴群組抽樣來估計優勢函數並穩定策略更新。實際應用中，由於計算資源限制，大規模群組並不可行，這會使學習過程偏向已具高概率的軌跡。較小的群組則往往遺漏罕見的正確軌跡，同時仍包含混合獎勵信號，導致概率質量集中於常見解。我們推導出更新過程遺漏罕見正確模式的概率與群組大小的函數關係，揭示其非單調性特徵，並刻畫更新如何在正確集合內重新分配概率質量，發現未抽樣正確軌跡的質量可能隨總正確質量增長而收縮。基於此分析，我們受焦點損失啟發提出難度感知的優勢縮放係數，該係數能對高成功率提示的更新進行降權處理。這種輕量級改進可直接整合至任何群組相對RLVR算法（如GRPO、DAPO、CISPO）。在Qwen2.5-7B模型上的領域內外基準測試表明，我們的方法將pass@256從64.1提升至70.3（GRPO）、69.3提升至72.5（DAPO）、73.2提升至76.8（CISPO），同時保持或提升pass@1指標，且無需增加群組規模或計算成本。

English

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating probability on common solutions. We derive the probability that updates miss rare-correct modes as a function of group size, showing non-monotonic behavior, and characterize how updates redistribute mass within the correct set, revealing that unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware advantage scaling coefficient, inspired by Focal loss, that down-weights updates on high-success prompts. The lightweight modification can be directly integrated into any group-relative RLVR algorithm such as GRPO, DAPO, and CISPO. On Qwen2.5-7B across in-domain and out-of-domain benchmarks, our method improves pass@256 from 64.1 rightarrow 70.3 (GRPO), 69.3 rightarrow 72.5 (DAPO), and 73.2 rightarrow 76.8 (CISPO), while preserving or improving pass@1, without increasing group size or computational cost.

F-GRPO：勿讓策略習於顯見而疏於罕例

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

摘要

Support