论大规模语言模型强化学习动态的可预测性

摘要

近期，大型语言模型（LLMs）在推理能力上的显著提升主要归功于强化学习（RL），然而RL训练过程中参数动态变化的本质仍鲜为人知。本研究揭示了LLMs中RL诱导参数更新的两个基本特性：（1）秩-1主导性，即参数更新矩阵的顶部奇异子空间几乎完全决定了推理能力的提升，恢复了超过99%的性能增益；（2）秩-1线性动态性，该主导子空间在整个训练过程中线性演变，使得从早期检查点即可准确预测最终结果。通过对8种LLMs和7种算法的广泛实验，验证了这些特性的普适性。更重要的是，基于这些发现，我们提出了AlphaRL，一个插件式加速框架，它利用早期短训练窗口外推最终参数更新，实现了高达2.5倍的加速，同时保持超过96%的推理性能，无需额外模块或超参数调整。这一发现为大规模RL提供了一种多功能且实用的工具，为LLMs开辟了一条原则性、可解释且高效的训练范式之路。

English

Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 8 LLMs and 7 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5 speedup while retaining \textgreater 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.

论大规模语言模型强化学习动态的可预测性

On Predictability of Reinforcement Learning Dynamics for Large Language Models

摘要

Support