大语言模型推理中的强化学习:单训练样本场景
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
April 29, 2025
作者: Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen
cs.AI
摘要
我们证明了使用单一训练样本进行可验证奖励的强化学习(1-shot RLVR)在激励大语言模型(LLMs)的数学推理能力方面具有显著效果。将RLVR应用于基础模型Qwen2.5-Math-1.5B,我们发现一个特定样本能够将模型在MATH500上的表现从36.0%提升至73.6%,并在六个常见数学推理基准测试中的平均表现从17.6%提升至35.7%。这一结果与使用包含该样本的1.2k DeepScaleR子集所获得的性能相当(MATH500:73.6%,平均:35.9%)。类似的显著提升在不同模型(Qwen2.5-Math-7B、Llama3.2-3B-Instruct、DeepSeek-R1-Distill-Qwen-1.5B)、RL算法(GRPO和PPO)以及不同数学样本中均有体现(其中许多样本作为单一训练样本使用时,在MATH500上带来了约30%或更高的提升)。此外,我们在1-shot RLVR过程中观察到了一些有趣现象,包括跨领域泛化、自我反思频率的增加,以及训练准确率饱和后测试性能的持续提升,我们称之为“饱和后泛化”。进一步,我们验证了1-shot RLVR的有效性主要源于策略梯度损失,从而将其与“顿悟”现象区分开来。我们还展示了在1-shot RLVR训练中促进探索(例如,通过添加适当系数的熵损失)的关键作用。作为额外发现,我们观察到仅应用熵损失而不依赖任何结果奖励,也能显著提升Qwen2.5-Math-1.5B在MATH500上的表现,提升幅度达27.4%。这些发现可为未来RLVR数据效率的研究提供启示,并鼓励重新审视RLVR领域的最新进展及其内在机制。我们的代码、模型和数据已在https://github.com/ypwang61/One-Shot-RLVR开源。
English
We show that reinforcement learning with verifiable reward using one training
example (1-shot RLVR) is effective in incentivizing the math reasoning
capabilities of large language models (LLMs). Applying RLVR to the base model
Qwen2.5-Math-1.5B, we identify a single example that elevates model performance
on MATH500 from 36.0% to 73.6%, and improves the average performance across six
common mathematical reasoning benchmarks from 17.6% to 35.7%. This result
matches the performance obtained using the 1.2k DeepScaleR subset (MATH500:
73.6%, average: 35.9%), which includes the aforementioned example. Similar
substantial improvements are observed across various models (Qwen2.5-Math-7B,
Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and
PPO), and different math examples (many of which yield approximately 30% or
greater improvement on MATH500 when employed as a single training example). In
addition, we identify some interesting phenomena during 1-shot RLVR, including
cross-domain generalization, increased frequency of self-reflection, and
sustained test performance improvement even after the training accuracy has
saturated, a phenomenon we term post-saturation generalization. Moreover, we
verify that the effectiveness of 1-shot RLVR primarily arises from the policy
gradient loss, distinguishing it from the "grokking" phenomenon. We also show
the critical role of promoting exploration (e.g., by adding entropy loss with
an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe
that applying entropy loss alone, without any outcome reward, significantly
enhances Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings
can inspire future work on RLVR data efficiency and encourage a re-examination
of both recent progress and the underlying mechanisms in RLVR. Our code, model,
and data are open source at https://github.com/ypwang61/One-Shot-RLVRSummary
AI-Generated Summary