重新思考強化學習對大型語言模型推理的作用:是稀疏策略選擇,而非能力學習
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
May 7, 2026
作者: Ömer Faruk Akgül, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna
cs.AI
摘要
強化學習已成為提升大型語言模型推理能力的標準方法,然而越來越多的證據顯示,強化學習並非教導模型新的策略,而是將機率質量重新分配至基礎模型已經擁有的解決方案中。本研究提出一個問題:若強化學習僅是引導模型走向其已知的路徑,那麼強化學習的優化循環本身是否必要?透過跨越多個模型家族與強化學習演算法的詞元級分析,我們發現強化學習的效益痕跡是一種稀疏、可預測的修正,集中在模型對分支選擇不確定的高熵決策點上。僅有1%至3%的詞元位置受到影響,被提升的詞元始終落在基礎模型前五個替代選項之內,且在這些少數位置進行針對性修正,便能因果性地恢復強化學習所帶來的大部分準確率提升,而隨機修正則無效。基礎模型自身的熵值無需任何強化學習訓練模型即可識別這些位置,且整個修正過程是低維度的,僅需極少量的模型參數即可表示。這些發現將推理能力的提升重新定義為稀疏策略選擇,而非能力獲取。我們將此見解轉化為ReasonMaxxer,一種極簡無強化學習的方法,僅在熵值門控的決策點應用對比損失,使用數百次基礎模型生成序列,無需線上生成。跨三個模型家族、六個規模及六項數學推理基準測試,ReasonMaxxer在僅需數十道問題與數分鐘單GPU訓練的情況下,達到或超越完整強化學習的效能,訓練成本降低約三個數量級。
English
Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.