关于大语言模型推理中RLVR更新的方向:识别与利用
On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
March 23, 2026
作者: Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, Jingren Zhou
cs.AI
摘要
可验证奖励强化学习(RLVR)显著提升了大型语言模型的推理能力。现有分析虽指出RLVR引发的模型更新具有稀疏性,但主要关注更新幅度,而忽视了更新方向的重要性。本研究提出更新方向是理解RLVR效果的关键视角,可通过基础模型与最终RLVR模型之间的符号化词元级对数概率差Δlog p来捕捉。通过统计分析和词元替换干预实验,我们证明相较于基于幅度的指标(如散度或熵),Δlog p能更有效地识别稀疏却对推理至关重要的更新。基于此发现,我们提出两种实际应用:(1)测试时外推法,沿习得的Δlog p方向放大策略以提升推理准确率,无需额外训练;(2)训练时重加权法,将学习重点集中于低概率(对应较高Δlog p)词元,从而提升不同模型与基准测试中的推理性能。本研究确立了变化方向作为分析和改进RLVR的核心原则。
English
Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the magnitude of these updates, largely overlooking their direction. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference Δlog p between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that Δlog p more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a test-time extrapolation method that amplifies the policy along the learned Δlog p direction to improve reasoning accuracy without further training; (2) a training-time reweighting method that focuses learning on low-probability (corresponding to higher Δlog p) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.