ExGRPO：从经验中学习推理

摘要

基于可验证奖励的强化学习（RLVR）是一种新兴范式，旨在提升大型语言模型的推理能力。然而，标准的在线策略训练在单次更新后便舍弃了探索经验，导致计算效率低下和训练不稳定。尽管先前关于强化学习的研究已强调了重用过往经验的好处，但经验特征在塑造大型推理模型学习动态中的作用仍未得到充分探索。本文首次探讨了何种推理经验具有价值，并识别出探索正确性和熵作为经验价值的有效指标。基于这些洞见，我们提出了ExGRPO（经验分组相对策略优化）框架，该框架组织并优先处理有价值的经验，并采用混合策略目标来平衡探索与经验利用。在五个骨干模型（1.5B至8B参数）上的实验表明，ExGRPO在数学/通用基准测试上持续提升了推理性能，相较于在线策略RLVR平均提高了3.5/7.6分。此外，ExGRPO在在线策略方法失效的更强或更弱模型上均实现了稳定的训练。这些结果凸显了原则性的经验管理作为高效且可扩展RLVR的关键要素的重要性。

English

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

ExGRPO：从经验中学习推理

ExGRPO: Learning to Reason from Experience

摘要

Support