F-GRPO: 明白な事象を学習させ、稀な事象を忘れさせない方策最適化

要旨

検証可能な報酬を用いた強化学習（RLVR）は、一般にグループサンプリングに基づいてアドバンテージを推定し、方策更新を安定化させる。実際には、計算量の制約から大きなグループサイズは実現不可能であり、学習は既に確率の高い軌道に偏りがちである。小規模なグループでは、混合報酬を含みつつも稀な正解軌道を見逃しがちで、確率質量が一般的な解に集中する。我々は、グループサイズの関数として更新が稀な正解モードを見逃す確率を導出し（非単調な振る舞いを示す）、更新が正解集合内で質量を再分配する仕組みを特徴付け、サンプリングされない正解質量が総正解質量の増加にも関わらず縮小し得ることを明らかにする。この分析に動機づけられ、Focal lossにヒントを得た難易度を考慮したアドバンテージスケーリング係数を提案する。これは高成功率のプロンプトに対する更新を重み付け減衰させる軽量な修正であり、GRPO、DAPO、CISPOなどのグループ相対RLVRアルゴリズムに直接組み込める。Qwen2.5-7Bにおけるドメイン内・ドメイン外ベンチマークで、本手法はpass@256をGRPOで64.1→70.3、DAPOで69.3→72.5、CISPOで73.2→76.8と改善し、pass@1を維持または向上させつつ、グループサイズや計算コストを増加させない。

English

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating probability on common solutions. We derive the probability that updates miss rare-correct modes as a function of group size, showing non-monotonic behavior, and characterize how updates redistribute mass within the correct set, revealing that unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware advantage scaling coefficient, inspired by Focal loss, that down-weights updates on high-success prompts. The lightweight modification can be directly integrated into any group-relative RLVR algorithm such as GRPO, DAPO, and CISPO. On Qwen2.5-7B across in-domain and out-of-domain benchmarks, our method improves pass@256 from 64.1 rightarrow 70.3 (GRPO), 69.3 rightarrow 72.5 (DAPO), and 73.2 rightarrow 76.8 (CISPO), while preserving or improving pass@1, without increasing group size or computational cost.

F-GRPO: 明白な事象を学習させ、稀な事象を忘れさせない方策最適化

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

要旨

Support