大規模言語モデルの推論におけるRLVR更新の方向性について：特定と活用

要旨

検証可能な報酬による強化学習（RLVR）は、大規模言語モデルの推論能力を大幅に向上させてきた。既存の分析では、RLVRによって引き起こされる変化がスパースであることが指摘されているが、それらは主に更新の大きさに焦点を当てており、その方向性はほぼ見過ごされている。本研究では、更新の方向性がRLVRの効果を理解する上でより重要な視点であると主張する。この方向性は、ベースモデルと最終的なRLVRモデル間の、符号付きトークンレベル対数確率差Δlog pによって捕捉できる。統計分析とトークン置換介入を通じて、Δlog pが、大きさに基づく指標（例えば、ダイバージェンスやエントロピー）よりも、スパースでありながら推論に決定的な更新をより効果的に特定することを実証する。この知見に基づき、我々は二つの実用的な応用法を提案する：（1）学習されたΔlog pの方向に沿って方策を増幅し、追加の学習なしで推論精度を向上させるテスト時外挿法、（2）学習を低確率（より高いΔlog pに対応）のトークンに集中させる学習時再重み付け法。これはモデルやベンチマークを超えて推論性能を向上させる。我々の研究は、変化の方向性をRLVRを分析し改善するための重要な原理として確立する。

English

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the magnitude of these updates, largely overlooking their direction. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference Δlog p between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that Δlog p more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a test-time extrapolation method that amplifies the policy along the learned Δlog p direction to improve reasoning accuracy without further training; (2) a training-time reweighting method that focuses learning on low-probability (corresponding to higher Δlog p) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.

大規模言語モデルの推論におけるRLVR更新の方向性について：特定と活用

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

要旨

Support