关于大语言模型推理中RLVR更新的方向：识别与利用

摘要

可验证奖励强化学习（RLVR）显著提升了大型语言模型的推理能力。现有分析虽指出RLVR引发的模型更新具有稀疏性，但主要关注更新幅度，而忽视了更新方向的重要性。本研究提出更新方向是理解RLVR效果的关键视角，可通过基础模型与最终RLVR模型之间的符号化词元级对数概率差Δlog p来捕捉。通过统计分析和词元替换干预实验，我们证明相较于基于幅度的指标（如散度或熵），Δlog p能更有效地识别稀疏却对推理至关重要的更新。基于此发现，我们提出两种实际应用：（1）测试时外推法，沿习得的Δlog p方向放大策略以提升推理准确率，无需额外训练；（2）训练时重加权法，将学习重点集中于低概率（对应较高Δlog p）词元，从而提升不同模型与基准测试中的推理性能。本研究确立了变化方向作为分析和改进RLVR的核心原则。

English

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the magnitude of these updates, largely overlooking their direction. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference Δlog p between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that Δlog p more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a test-time extrapolation method that amplifies the policy along the learned Δlog p direction to improve reasoning accuracy without further training; (2) a training-time reweighting method that focuses learning on low-probability (corresponding to higher Δlog p) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.

关于大语言模型推理中RLVR更新的方向：识别与利用

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

摘要

Support