強化學習是否真能激勵大型語言模型超越基礎模型的推理能力?
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
April 18, 2025
作者: Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang
cs.AI
摘要
可驗證獎勵的強化學習(RLVR)近期在提升大型語言模型(LLMs)的推理能力方面取得了顯著成功,特別是在數學和編程任務中。普遍認為,RLVR使LLMs能夠持續自我改進,從而獲得超越相應基礎模型能力的新推理能力。然而,在本研究中,我們通過測量大k值下的pass@k指標,重新審視了這一假設,以探索不同模型家族和基準測試中模型的推理能力邊界。令人驚訝的是,強化學習實際上並未引發根本性的新推理模式。雖然在較小的k值(例如k=1)下,經過RL訓練的模型表現優於其基礎模型,但在大k值下,基礎模型可以達到與其RL對應模型相當甚至更高的pass@k分數。RL訓練模型生成的推理路徑已經包含在基礎模型的採樣分佈中,這表明RL訓練模型所展現的大多數推理能力已經被基礎模型所掌握。進一步分析顯示,RL訓練通過偏向於更可能獲得獎勵的路徑來提升模型性能,從而更有效地採樣正確答案。但這也導致了與基礎模型相比更窄的推理能力邊界。在視覺推理任務中,使用RLVR訓練也觀察到了類似的結果。此外,我們發現蒸餾可以真正為模型引入新知識,這與RLVR不同。這些發現凸顯了RLVR在提升LLM推理能力方面的關鍵限制,促使我們從根本上重新思考RL訓練在推理LLMs中的影響以及對更好範式的需求。項目頁面:https://limit-of-RLVR.github.io
English
Reinforcement Learning with Verifiable Rewards (RLVR) has recently
demonstrated notable success in enhancing the reasoning capabilities of LLMs,
particularly in mathematics and programming tasks. It is widely believed that
RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning
abilities that exceed corresponding base models' capacity. In this study,
however, we critically re-examines this assumption by measuring the
pass@k metric with large values of k to explore the reasoning
capability boundary of the models across a wide range of model families and
benchmarks. Surprisingly, the RL does not, in fact, elicit fundamentally
new reasoning patterns. While RL-trained models outperform their base models at
smaller values of k (\eg, k=1), base models can achieve a comparable or
even higher pass@k score compared to their RL counterparts at large k
values. The reasoning paths generated by RL-trained models are already included
in the base models' sampling distribution, suggesting that most reasoning
abilities manifested in RL-trained models are already obtained by base models.
Further analysis shows that RL training boosts the performance by biasing the
model's output distribution toward paths that are more likely to yield rewards,
therefore sampling correct responses more efficiently. But this also results in
a narrower reasoning capability boundary compared to base models. Similar
results are observed in visual reasoning tasks trained with RLVR. Moreover, we
find that distillation can genuinely introduce new knowledge into the model,
different from RLVR. These findings underscore a critical limitation of RLVR in
advancing LLM reasoning abilities which requires us to fundamentally rethink
the impact of RL training in reasoning LLMs and the need of a better paradigm.
Project Page: https://limit-of-RLVR.github.ioSummary
AI-Generated Summary