最小限のRLVR訓練だけで十分：ランク1軌跡によるLLMの外挿

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデル（LLM）の推論能力を向上させる主要なパラダイムとなっているが、結果として得られるパラメータ軌道の幾何学的性質は依然として十分に解明されていない。本研究では、RLVRの重み軌道が極めて低ランクであり、高い予測可能性を持つことを示す。具体的には、下流タスクの性能向上の大部分がパラメータ差分のランク1近似によって捉えられ、その投影の大きさが訓練ステップに応じてほぼ線形に変化することを発見した。これに着想を得て、我々は簡潔で計算効率の高い手法RELEX（REinforcement Learning EXtrapolation）を提案する。これは短い観測ウィンドウからランク1部分空間を推定し、線形回帰を用いて将来のチェックポイントを外挿するものであり、学習モデルを必要としない。3つのモデル（Qwen2.5-Math-1.5B、Qwen3-4B-Base、Qwen3-8B-Base）において、RELEXはドメイン内およびドメイン外の両方のベンチマークでRLVRと同等以上の性能を示すチェックポイントを生成し、必要なステップ数はフルRLVR訓練のわずか15%である。注目すべきことに、RELEXは訓練コストゼロで観測ウィンドウをはるかに超えて外挿することができ、観測されたプレフィックスの10～20倍先のチェックポイントまで継続的な改善とともに予測する（例えば、最初の50ステップのみを観測し、1000ステップまで外挿する）。我々のアブレーション解析は、RELEXの最小限の十分性を確認している。すなわち、部分空間のランクを増やしても、非線形モデリングを用いても、外挿性能のさらなる向上は得られない。最後に、RELEXの成功は「ノイズ除去」効果に起因することを示す。すなわち、更新をランク1部分空間に投影することで、外挿時に性能を低下させる確率的最適化ノイズが除去される。我々のコードはhttps://github.com/weizhepei/RELEXで公開されている。

English

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20times beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.