基于低秩优化轨迹建模的大语言模型强化学习价值恢复加速

摘要

近期，基于可验证奖励的规模化强化学习（RLVR）已成为显著提升大语言模型（LLMs）能力的有效训练范式。该范式需引导模型进行大量探索与学习，导致计算开销巨大并成为关键挑战。为减少训练步数，已有研究采用模型参数的线性外推方法。然而，RLVR训练过程中模型参数更新的动态特性尚未得到充分理解。为深入探究LLMs在RLVR训练中的演化规律，我们通过实证实验发现：模型的秩-1子空间并不呈线性演化，且在LoRA训练过程中其相对于原始参数的主导作用会进一步放大。基于上述发现，我们提出低秩轨迹非线性外推框架NExt，通过非线性方式建模并外推低秩参数轨迹。具体而言，我们首先采用LoRA训练模型并提取多个训练步骤中参数差异的秩-1子空间，随后进行非线性外推。接着利用提取的秩-1子空间训练预测器，该预测器可建模RLVR过程中参数更新轨迹，通过“预测-扩展”流程实现模型参数外推，从而达到加速RLVR的目的。为深入理解NExt，我们开展了系统性实验验证方法的有效性与鲁棒性。该方法在保持与多种RLVR算法及任务兼容的同时，可降低约37.5%的计算开销。代码已开源於：https://github.com/RUCAIBox/NExt。

English

Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the Nonlinear Extrapolation of low-rank trajectories (NExt), a novel framework that models and extrapolates low-rank parameter trajectories in a nonlinear manner. Concretely, we first train the model using LoRA and extract the rank-1 subspace of parameter differences at multiple training steps, which is then used for the subsequent nonlinear extrapolation. Afterward, we utilized the extracted rank-1 subspace to train a predictor, which can model the trajectory of parameter updates during RLVR, and then perform the predict-extend process to extrapolate model parameters, achieving the acceleration of RLVR. To further study and understand NExt, we conduct comprehensive experiments that demonstrate the effectiveness and robustness of the method. Our method reduces computational overhead by approximately 37.5\% while remaining compatible with a wide range of RLVR algorithms and tasks. We release our code in https://github.com/RUCAIBox/NExt.

基于低秩优化轨迹建模的大语言模型强化学习价值恢复加速

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

摘要

Support