최소한의 RLVR 훈련만으로 충분: 랭크-1 궤적을 통한 LLM 외삽

초록

검증 가능한 보상 기반 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 주요 패러다임으로 자리 잡았으나, 그 결과로 나타나는 파라미터 궤적의 기하학적 구조는 여전히 충분히 탐구되지 않았다. 본 연구에서는 RLVR 가중치 궤적이 극도로 저차원적이며 예측 가능성이 높음을 보인다. 구체적으로, 하위 과제 성능 향상의 대부분은 파라미터 변화량의 랭크-1 근사에 의해 포착되며, 이 투영의 크기는 훈련 단계에 따라 거의 선형적으로 변화함을 발견하였다. 이에 착안하여, 우리는 간단하고 계산 효율적인 방법인 RELEX(Reinforcement Learning EXtrapolation)를 제안한다. 이 방법은 짧은 관측 윈도우에서 랭크-1 부분공간을 추정하고 선형 회귀를 통해 미래의 체크포인트를 외삽하며, 학습된 모델이 필요하지 않다. 세 가지 모델(Qwen2.5-Math-1.5B, Qwen3-4B-Base, Qwen3-8B-Base)에 걸쳐 RELEX는 도메인 내 및 도메인 외 벤치마크 모두에서 RLVR 성능과 일치하거나 이를 초과하는 체크포인트를 생성하며, 전체 RLVR 훈련의 15% 미만의 단계만 필요로 한다. 놀랍게도, RELEX는 훈련 비용 없이 관측 윈도우를 훨씬 넘어 외삽할 수 있으며, 관측된 접두사의 10~20배까지 체크포인트를 예측하면서 지속적인 성능 향상을 보인다 (예: 처음 50단계만 관측하고 1000단계까지 외삽). 제거 분석을 통해 RELEX의 최소주의적 충분성이 확인되었다: 부분공간의 랭크를 높이거나 비선형 모델링을 사용해도 외삽에서 추가적인 이득이 발생하지 않는다. 마지막으로, 우리는 RELEX의 성공이 '노이즈 제거' 효과에서 비롯됨을 보인다: 업데이트를 랭크-1 부분공간에 투영함으로써, 모델은 외삽 중 성능을 저하시킬 확률적 최적화 노이즈를 제거한다. 우리의 코드는 https://github.com/weizhepei/RELEX에서 확인할 수 있다.

English

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20times beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.