重新思考面向大语言模型推理的强化学习：关键是稀疏策略选择，而非能力学习

摘要

强化学习已成为提升大型语言模型推理能力的标准方法，但越来越多证据表明，RL并未教授新策略，而是将概率质量重新分配至基础模型已包含的解决方案中。本研究提出疑问：若RL仅是将模型导向其已知路径，那么RL优化循环本身是否必要？通过跨多个模型家族和RL算法的词元级分析，我们发现RL的有益影响是一种稀疏且可预测的修正，集中于模型对分支选择不确定的高熵决策点。仅1%-3%的词元位置受到影响，被提升的词元始终位于基础模型前5个备选方案内，且在这些少数位置进行定向修正可因果性地恢复RL大部分精度增益，而随机修正则无效。基础模型自身的熵值可在无任何RL训练模型的情况下识别这些位置，整个修正过程呈低维特性，仅需极小比例的模型参数即可表征。这些发现将推理改进重新定义为稀疏策略选择而非能力获取。我们将这一洞见转化为ReasonMaxxer方法——一种极简的免RL方法，仅对熵门控决策点应用对比损失，基于数百次基础模型采样且无需在线生成。在三个模型家族、六种规模及六个数学推理基准测试中，ReasonMaxxer匹配或超越完整RL性能，同时仅需数十个问题及单GPU数分钟训练，训练成本降低约三个数量级。

English

Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.

重新思考面向大语言模型推理的强化学习：关键是稀疏策略选择，而非能力学习

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

摘要

Support