ExGRPO:基於經驗的推理學習
ExGRPO: Learning to Reason from Experience
October 2, 2025
作者: Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)是一種新興範式,旨在提升大型語言模型的推理能力。然而,標準的線上策略訓練在單次更新後便捨棄了滾動經驗,導致計算效率低下和訓練不穩定。雖然先前關於強化學習的研究已強調了重用過去經驗的益處,但經驗特徵在塑造大型推理模型學習動態中的作用仍未被充分探索。本文首次探討了何種推理經驗具有價值,並識別出滾動正確性和熵作為經驗價值的有效指標。基於這些洞見,我們提出了ExGRPO(經驗分組相對策略優化)框架,該框架組織並優先處理有價值的經驗,並採用混合策略目標來平衡探索與經驗利用。在五個骨幹模型(1.5B至8B參數)上的實驗表明,ExGRPO在數學/通用基準測試上持續提升了推理性能,相比於線上策略RLVR平均增益分別為+3.5/7.6分。此外,ExGRPO在線上策略方法失效的強弱模型上均實現了訓練穩定。這些結果凸顯了原則性經驗管理作為高效且可擴展RLVR的關鍵要素。
English
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm
for improving the reasoning ability of large language models. However, standard
on-policy training discards rollout experiences after a single update, leading
to computational inefficiency and instability. While prior work on RL has
highlighted the benefits of reusing past experience, the role of experience
characteristics in shaping learning dynamics of large reasoning models remains
underexplored. In this paper, we are the first to investigate what makes a
reasoning experience valuable and identify rollout correctness and entropy as
effective indicators of experience value. Based on these insights, we propose
ExGRPO (Experiential Group Relative Policy Optimization), a framework that
organizes and prioritizes valuable experiences, and employs a mixed-policy
objective to balance exploration with experience exploitation. Experiments on
five backbone models (1.5B-8B parameters) show that ExGRPO consistently
improves reasoning performance on mathematical/general benchmarks, with an
average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO
stabilizes training on both stronger and weaker models where on-policy methods
fail. These results highlight principled experience management as a key
ingredient for efficient and scalable RLVR.