ExGRPO：基於經驗的推理學習

摘要

基於可驗證獎勵的強化學習（RLVR）是一種新興範式，旨在提升大型語言模型的推理能力。然而，標準的線上策略訓練在單次更新後便捨棄了滾動經驗，導致計算效率低下和訓練不穩定。雖然先前關於強化學習的研究已強調了重用過去經驗的益處，但經驗特徵在塑造大型推理模型學習動態中的作用仍未被充分探索。本文首次探討了何種推理經驗具有價值，並識別出滾動正確性和熵作為經驗價值的有效指標。基於這些洞見，我們提出了ExGRPO（經驗分組相對策略優化）框架，該框架組織並優先處理有價值的經驗，並採用混合策略目標來平衡探索與經驗利用。在五個骨幹模型（1.5B至8B參數）上的實驗表明，ExGRPO在數學/通用基準測試上持續提升了推理性能，相比於線上策略RLVR平均增益分別為+3.5/7.6分。此外，ExGRPO在線上策略方法失效的強弱模型上均實現了訓練穩定。這些結果凸顯了原則性經驗管理作為高效且可擴展RLVR的關鍵要素。

English

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

ExGRPO：基於經驗的推理學習

ExGRPO: Learning to Reason from Experience

摘要

Support