面向大语言模型强化学习价值回归加速的低秩优化轨迹建模

摘要

近期，基于可验证奖励的规模化强化学习（RLVR）已成为显著提升大语言模型能力的高效训练范式。该范式需引导模型进行广泛探索与学习，导致计算开销巨大并成为关键挑战。为减少训练步数，已有研究采用模型参数的线性外推方法。然而，RLVR训练过程中模型参数更新的动态机制尚未得到充分认知。为深入探究LLMs在RLVR训练中的演化规律，我们通过实证实验发现：模型的秩-1子空间并不呈线性演化，且在LoRA训练中其相对于原始参数的主导作用会进一步放大。基于上述发现，我们提出低秩轨迹非线性外推框架NExt，通过对低秩参数轨迹进行非线性建模与外推。具体而言，我们首先采用LoRA训练模型，并在多个训练步骤提取参数差异的秩-1子空间用于后续非线性外推；随后利用该子空间训练预测器，建模RLVR过程中参数更新轨迹，通过“预测-扩展”流程实现模型参数外推，最终达成RLVR加速目标。为深入解析NExt机制，我们开展系统性实验验证了方法的有效性与鲁棒性。该方法在保持与多种RLVR算法及任务兼容性的同时，可降低约37.5%的计算开销。代码已开源于https://github.com/RUCAIBox/NExt。

English

Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the Nonlinear Extrapolation of low-rank trajectories (NExt), a novel framework that models and extrapolates low-rank parameter trajectories in a nonlinear manner. Concretely, we first train the model using LoRA and extract the rank-1 subspace of parameter differences at multiple training steps, which is then used for the subsequent nonlinear extrapolation. Afterward, we utilized the extracted rank-1 subspace to train a predictor, which can model the trajectory of parameter updates during RLVR, and then perform the predict-extend process to extrapolate model parameters, achieving the acceleration of RLVR. To further study and understand NExt, we conduct comprehensive experiments that demonstrate the effectiveness and robustness of the method. Our method reduces computational overhead by approximately 37.5\% while remaining compatible with a wide range of RLVR algorithms and tasks. We release our code in https://github.com/RUCAIBox/NExt.