ExGRPO: 経験から推論を学習する

要旨

検証可能な報酬からの強化学習（RLVR）は、大規模言語モデルの推論能力を向上させるための新たなパラダイムとして注目を集めている。しかし、標準的なオン・ポリシー学習では、ロールアウト経験が一度の更新後に破棄されるため、計算効率の低下や不安定性が生じる。過去のRL研究では、過去の経験を再利用することの利点が強調されてきたが、大規模推論モデルの学習ダイナミクスに及ぼす経験特性の役割は十分に検討されていない。本論文では、推論経験の価値を決定する要因を初めて調査し、ロールアウトの正解率とエントロピーが経験価値の有効な指標であることを明らかにした。これらの知見に基づき、我々はExGRPO（Experiential Group Relative Policy Optimization）を提案する。これは、価値ある経験を整理し優先順位付けし、探索と経験活用のバランスを取るための混合ポリシー目的関数を採用するフレームワークである。1.5Bから8Bパラメータの5つの基盤モデルを用いた実験では、ExGRPOが数学的/一般的なベンチマークにおいて一貫して推論性能を向上させ、オン・ポリシーRLVRに対して平均+3.5/7.6ポイントの向上を示した。さらに、ExGRPOは、オン・ポリシー手法が失敗する強力なモデルと弱いモデルの両方において、学習を安定化させた。これらの結果は、効率的でスケーラブルなRLVRの実現において、原理に基づいた経験管理が重要な要素であることを示唆している。

English

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

ExGRPO: 経験から推論を学習する

ExGRPO: Learning to Reason from Experience

要旨

Support