強化学習は、ベースモデルを超えたLLMの推論能力を真に促進するのか？

要旨

検証可能な報酬を用いた強化学習（RLVR）は最近、特に数学やプログラミングタスクにおいて、大規模言語モデル（LLM）の推論能力を向上させることに顕著な成功を収めています。RLVRはLLMが継続的に自己改善し、対応するベースモデルの能力を超える新たな推論能力を獲得できると広く信じられています。しかし、本研究ではこの仮定を批判的に再検証し、大きなk値でのpass@kメトリックを測定することで、様々なモデルファミリーとベンチマークにわたるモデルの推論能力の限界を探ります。驚くべきことに、RLは実際には根本的に新しい推論パターンを引き出しません。RLで訓練されたモデルは小さいk値（例：k=1）ではベースモデルを上回りますが、大きなk値ではベースモデルがRLモデルと同等またはそれ以上のpass@kスコアを達成できます。RLで訓練されたモデルが生成する推論パスは、ベースモデルのサンプリング分布に既に含まれており、RLモデルに現れる推論能力のほとんどはベースモデルによって既に獲得されていることが示唆されます。さらに分析すると、RL訓練は報酬を得る可能性が高いパスに向けてモデルの出力分布を偏らせることで性能を向上させ、正しい応答をより効率的にサンプリングします。しかし、これによりベースモデルと比較して推論能力の限界が狭まります。RLVRで訓練された視覚推論タスクでも同様の結果が観察されます。さらに、蒸留はRLVRとは異なり、モデルに真に新しい知識を導入できることが分かります。これらの発見は、LLMの推論能力を進歩させる上でのRLVRの重要な限界を強調し、推論LLMにおけるRL訓練の影響とより良いパラダイムの必要性を根本的に再考することを求めています。プロジェクトページ: https://limit-of-RLVR.github.io

English

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@k metric with large values of k to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does not, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of k (\eg, k=1), base models can achieve a comparable or even higher pass@k score compared to their RL counterparts at large k values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io

強化学習は、ベースモデルを超えたLLMの推論能力を真に促進するのか？

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

要旨

Support