LLM RLVR 가속화를 위한 저차원 최적화 궤적 모델링

초록

최근 검증 가능한 보상을 활용한 대규모 언어 모델(Large Language Models, LLMs)의 확장 강화 학습(Scaling Reinforcement Learning with Verifiable Rewards, RLVR)은 모델 성능을 크게 향상시키는 효과적인 훈련 패러다임으로 부각되고 있습니다. 이 패러다임은 모델이 광범위한 탐색과 학습을 수행하도록 유도해야 하므로 상당한 계산 오버헤드가 발생하며, 이는 핵심적인 과제로 대두되고 있습니다. 훈련 단계 수를 줄이기 위해 기존 연구에서는 모델 매개변수의 선형 외삽(linear extrapolation)을 수행해 왔습니다. 그러나 RLVR 훈련 중 모델 매개변수 업데이트의 역학(dynamics)은 아직 충분히 이해되지 않고 있습니다. RLVR 훈련 중 LLMs의 진화를 추가로 조사하기 위해 우리는 실증 실험을 수행하였으며, 모델의 rank-1 부분공간(subspace)이 선형적으로 진화하지 않으며, LoRA(Low-Rank Adaptation) 훈련 동안 원본 매개변수에 대한 이 부분공간의 우세함(dominance)이 더욱 증폭된다는 사실을 발견했습니다. 이러한 통찰을 바탕으로 우리는 낮은 순위의 매개변수 궤적을 비선형적으로 모델링 및 외삽하는 새로운 프레임워크인 NExt(Nonlinear Extrapolation of low-rank trajectories)를 제안합니다. 구체적으로, 우리는 먼저 LoRA를 사용하여 모델을 훈련시키고 여러 훈련 단계에서 매개변수 차이의 rank-1 부분공간을 추출하여 이후의 비선형 외삽에 활용합니다. 이후, 추출된 rank-1 부분공간을 이용하여 예측기(predictor)를 훈련시킵니다. 이 예측기는 RLVR 동안 발생하는 매개변수 업데이트의 궤적을 모델링할 수 있으며, 이후 예측-확장(predict-extend) 과정을 수행하여 모델 매개변수를 외삽함으로써 RLVR의 가속화를 달성합니다. NExt를 더 깊이 연구하고 이해하기 위해 우리는 포괄적인 실험을 수행하여 이 방법의 효과성과 강건성(robustness)을 입증했습니다. 우리의 방법은 다양한 RLVR 알고리즘 및 작업과의 호환성을 유지하면서 계산 오버헤드를 약 37.5% 줄입니다. 우리는 코드를 https://github.com/RUCAIBox/NExt 에 공개했습니다.

English

Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the Nonlinear Extrapolation of low-rank trajectories (NExt), a novel framework that models and extrapolates low-rank parameter trajectories in a nonlinear manner. Concretely, we first train the model using LoRA and extract the rank-1 subspace of parameter differences at multiple training steps, which is then used for the subsequent nonlinear extrapolation. Afterward, we utilized the extracted rank-1 subspace to train a predictor, which can model the trajectory of parameter updates during RLVR, and then perform the predict-extend process to extrapolate model parameters, achieving the acceleration of RLVR. To further study and understand NExt, we conduct comprehensive experiments that demonstrate the effectiveness and robustness of the method. Our method reduces computational overhead by approximately 37.5\% while remaining compatible with a wide range of RLVR algorithms and tasks. We release our code in https://github.com/RUCAIBox/NExt.

LLM RLVR 가속화를 위한 저차원 최적화 궤적 모델링

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

초록

Support