均衡的な集約：GRPOにおける集約バイアスの理解と修正

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデルの推論およびコード生成能力を向上させる中心的なパラダイムとなっており、GRPOスタイルの訓練はその簡潔さと有効性から広く採用されている。しかし、重要な設計選択として、各サンプリンググループ内でトークンレベルの方策勾配項をどのように集約するかについては未解明のままである。標準的なGRPOはシーケンス集約を使用するが、近年の研究ではより優れた代替案としてトークン集約が提唱されている。本論文では、これら二つの規則が異なる最適化バイアスを誘導することを示す：トークン集約は符号-長度連関を導入する一方、シーケンス集約はシーケンスレベルの均等重み付けを通じて長い応答を暗黙的に軽視する。この対立を解決するため、我々はBalanced Aggregation（BA）を提案する。これは、正例サブセットと負例サブセット内でトークンレベルの平均値を個別に計算し、それらをシーケンス数に基づく重みで結合する簡易なドロップイン代替手法である。Qwen2.5-Math-7BおよびQwen3-1.7Bを用い、DAPO-17kとPolarisで訓練し、6つの推論およびコーディングベンチマークで評価した実験結果は、BAが標準的なトークン集約およびシーケンス集約と比較して、訓練の安定性と最終性能を一貫して向上させることを示している。我々の分析はさらに、トークン集約とシーケンス集約の相対的有効性が、応答長の変動と正例-負例間の長度差によって大きく支配されることを明らかにし、集約方法がGRPOスタイルRLVRにおける重要な設計次元であることを浮き彫りにしている。

English

Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative. We show that these two rules induce different optimization biases: token aggregation introduces sign-length coupling, while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting. To address this tension, we propose Balanced Aggregation (BA), a simple drop-in replacement that computes token-level means separately within the positive and negative subsets and then combines them with sequence-count-based weights. Experiments with Qwen2.5-Math-7B and Qwen3-1.7B on DAPO-17k and Polaris, evaluated on six reasoning and coding benchmarks, show that BA consistently improves training stability and final performance over standard token and sequence aggregation. Our analysis further shows that the relative effectiveness of token and sequence aggregation is largely governed by response-length variation and the positive-negative length gap, highlighting aggregation as a critical design dimension in GRPO-style RLVR.

均衡的な集約：GRPOにおける集約バイアスの理解と修正

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

要旨

Support