強化學習在大型語言模型中的推理應用：僅需一個訓練範例

摘要

我們證明了使用單一訓練樣本的驗證獎勵強化學習（1-shot RLVR）能有效激勵大型語言模型（LLMs）的數學推理能力。將RLVR應用於基礎模型Qwen2.5-Math-1.5B，我們發現一個單一樣本能將模型在MATH500上的表現從36.0%提升至73.6%，並將六個常見數學推理基準的平均表現從17.6%提高至35.7%。這一結果與使用包含上述樣本的1.2k DeepScaleR子集所獲得的性能相當（MATH500：73.6%，平均：35.9%）。在不同模型（Qwen2.5-Math-7B、Llama3.2-3B-Instruct、DeepSeek-R1-Distill-Qwen-1.5B）、RL算法（GRPO和PPO）以及不同數學樣本（許多樣本作為單一訓練樣本時，能在MATH500上帶來約30%或更高的提升）中，均觀察到了類似的顯著改進。此外，我們在1-shot RLVR過程中發現了一些有趣現象，包括跨領域泛化、自我反思頻率增加，以及訓練準確率飽和後測試性能仍持續提升的現象，我們稱之為後飽和泛化。進一步，我們驗證了1-shot RLVR的有效性主要源於策略梯度損失，這與“頓悟”現象有所區別。我們還展示了在1-shot RLVR訓練中促進探索（例如，通過添加適當係數的熵損失）的關鍵作用。作為額外發現，我們觀察到僅應用熵損失而無任何結果獎勵，也能顯著提升Qwen2.5-Math-1.5B在MATH500上的表現，提升幅度達27.4%。這些發現可激勵未來關於RLVR數據效率的研究，並促使重新審視RLVR的最新進展及其底層機制。我們的代碼、模型和數據已開源於https://github.com/ypwang61/One-Shot-RLVR。

English

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR

強化學習在大型語言模型中的推理應用：僅需一個訓練範例

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

摘要

Support