ChatPaper.aiChatPaper

未竟之路:強化學習價值再現理論偏離原則的學習證明

The Path Not Taken: RLVR Provably Learns Off the Principals

November 11, 2025
作者: Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, Kai Sheng Tai
cs.AI

摘要

具備可驗證獎勵的強化學習(RLVR)能可靠提升大型語言模型的推理性能,但似乎僅修改了極少部分參數。我們重新審視這一悖論,發現稀疏性實際上是模型條件化優化偏差的表象:對於固定預訓練模型,參數更新始終集中於偏好區域,且該現象在不同實驗運行間高度一致,對數據集和RL訓練方案的變化也保持不變。我們通過「三門理論」機理性地解釋這種動態:門I(KL錨點)施加KL約束的更新;門II(模型幾何)將更新步長引導至偏離主方向、落入低曲率且保持譜結構的子空間;門III(精度調控)將非偏好區域的微觀更新隱藏,使偏離主方向的偏差呈現為稀疏性。我們驗證了該理論,並首次實現RLVR學習動態的參數層級表徵:RLVR在權重空間中沿非主方向學習,通過最小化譜漂移、減少主空間旋轉以及實現非主方向更新對齊來獲得增益。相比之下,監督微調(SFT)針對主權重進行更新,扭曲譜結構,其效果甚至遜於RLVR。 這些發現共同構建了首個從參數空間解讀RLVR訓練動態的框架,揭示了參數演化過程中的清晰規律。關鍵在於,我們證明RL運作於有別於SFT的優化機制,因此直接套用SFT時代的參數高效微調(PEFT)方法存在缺陷——我們對先進稀疏微調及LoRA變體的案例研究證實了這一點。本研究旨在為實現RLVR的白盒化理解鋪路,推動設計幾何感知、原生適配RLVR的學習算法,而非簡單復用SFT時代的啟發式方法。
English
Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR's learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.
PDF312December 2, 2025