大規模言語モデルにおける推論のための強化学習：1つの訓練例を用いて

要旨

我々は、1つの訓練例を用いた検証可能な報酬による強化学習（1-shot RLVR）が、大規模言語モデル（LLMs）の数学的推論能力を向上させるのに有効であることを示す。RLVRをベースモデルであるQwen2.5-Math-1.5Bに適用し、MATH500におけるモデルの性能を36.0%から73.6%に向上させ、6つの一般的な数学的推論ベンチマークにおける平均性能を17.6%から35.7%に改善する単一の例を特定した。この結果は、前述の例を含む1.2kのDeepScaleRサブセット（MATH500: 73.6%, 平均: 35.9%）を用いて得られた性能と一致する。同様の大幅な改善は、様々なモデル（Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B）、RLアルゴリズム（GRPOおよびPPO）、および異なる数学的例（多くの場合、単一の訓練例として使用された際にMATH500で約30%以上の改善をもたらす）においても観察された。さらに、1-shot RLVRの過程で、ドメイン間の汎化、自己反省の頻度の増加、訓練精度が飽和した後も持続するテスト性能の向上（我々が「飽和後汎化」と呼ぶ現象）といった興味深い現象を特定した。また、1-shot RLVRの有効性が主にポリシー勾配損失に起因することを確認し、「グロッキング」現象とは区別されることを示した。さらに、1-shot RLVRの訓練において、探索を促進すること（例えば、適切な係数でエントロピー損失を追加するなど）の重要性を示した。副次的な発見として、結果報酬なしでエントロピー損失のみを適用することで、Qwen2.5-Math-1.5BのMATH500における性能が27.4%向上することを観察した。これらの知見は、RLVRのデータ効率に関する将来の研究を刺激し、RLVRの最近の進展とその基盤となるメカニズムの再検討を促すものである。我々のコード、モデル、データはhttps://github.com/ypwang61/One-Shot-RLVRでオープンソースとして公開されている。

English

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR

大規模言語モデルにおける推論のための強化学習：1つの訓練例を用いて

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

要旨

Support