仅需最小RLVR训练：通过秩1轨迹外推大语言模型

摘要

基于可验证奖励的强化学习（RLVR）已成为提升大型语言模型（LLM）推理能力的主导范式，然而，模型参数轨迹背后的几何特性仍未被充分探索。在本工作中，我们证明RLVR的权重轨迹具有极低的秩且高度可预测。具体而言，我们发现下游性能的大部分提升可由参数增量（parameter deltas）的秩为1的近似捕捉，且该投影的幅值随训练步数近乎线性地演化。受此启发，我们提出一种简单且计算高效的方法RELEX（REinforcement Learning EXtrapolation），该方法通过短观测窗口估计秩为1的子空间，并利用线性回归外推未来检查点，无需任何学习模型。在三个模型（Qwen2.5-Math-1.5B、Qwen3-4B-Base和Qwen3-8B-Base）上，RELEX生成的检查点在域内和域外基准测试中均达到或超越了RLVR的性能，且仅需完整RLVR训练步数的15%。值得注意的是，RELEX能够在无训练代价的情况下将观测窗口外推至远超其范围，预测超出观测前缀10至20倍的检查点并持续改进（例如，仅观测前50步即可外推至1000步）。我们的消融分析证实了RELEX的最小充分性：既不需要增加子空间秩，也无需采用非线性建模来进一步提升外推性能。最后，我们证明RELEX的成功源于一种“去噪”效应：通过将更新投影到秩为1的子空间，模型丢弃了随机优化噪声，否则该噪声会在外推过程中降低性能。我们的代码已开源：https://github.com/weizhepei/RELEX。

English

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20times beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.