未择之路:RLVR算法在偏离主体原则下的可证明学习
The Path Not Taken: RLVR Provably Learns Off the Principals
November 11, 2025
作者: Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, Kai Sheng Tai
cs.AI
摘要
带有可验证奖励的强化学习(RLVR)能可靠提升大语言模型的推理性能,但其似乎仅修改了极少部分参数。我们重新审视这一悖论,发现稀疏性只是模型条件化优化偏差的表象:对于固定预训练模型,参数更新始终集中于偏好区域,且该现象在不同实验运行间高度一致,对数据集和RL训练方案的改变也保持稳定。我们通过"三闸门理论"机制化解释这一动态:闸门I(KL锚点)施加KL约束的更新;闸门II(模型几何)将更新步长导向偏离主方向的高斯曲率子空间以保持频谱稳定;闸门III(精度掩码)将非偏好区域的微观更新隐藏,使偏离主方向的偏差呈现为稀疏性。我们验证该理论并首次实现RLVR学习动态的参数级刻画:RLVR在权重空间中沿非主方向学习,通过最小化频谱偏移、减少主空间旋转以及对齐非主方向更新获得增益。相比之下,监督微调(SFT)以主权重为目标,扭曲频谱特征,其效果甚至落后于RLVR。
这些发现共同构成了RLVR训练动态的首个参数空间阐释,揭示了参数演化过程中的清晰规律。关键的是,我们证明RL处于与SFT截然不同的优化范式,因此直接套用SFT时代的参数高效微调(PEFT)方法存在缺陷——我们对先进稀疏微调及LoRA变体的案例研究证实了这一点。本研究有望为理解RLVR的白盒机制指明方向,推动设计几何感知的RLVR原生学习算法,而非简单移植SFT时代的启发式方法。
English
Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR's learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR.
Together, these results provide the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.