ChatPaper.aiChatPaper

評估強化學習與視覺推理的參數高效方法

Evaluating Parameter Efficient Methods for RLVR

December 29, 2025
作者: Qingyu Yin, Yulun Wu, Zhennan Shen, Sunbowen Li, Zhilin Wang, Yanshu Li, Chak Tou Leong, Jiale Kang, Jinjin Gu
cs.AI

摘要

我們在可驗證獎勵強化學習(RLVR)範式下,系統性地評估了參數高效微調(PEFT)方法。RLVR通過可驗證的反饋激勵語言模型提升推理能力;然而,儘管像LoRA這樣的方法被廣泛使用,適用於RLVR的最佳PEFT架構仍不明確。本研究首次在數學推理基準上,對DeepSeek-R1蒸餾系列模型中的12種以上PEFT方法進行全面評估。實證結果對默認採用標準LoRA的做法提出挑戰,主要發現有三點:首先,我們證明結構變體(如DoRA、AdaLoRA和MiSS)持續優於LoRA;其次,我們發現基於SVD的初始化策略(如PiSSA、MiLoRA)存在譜崩塌現象,其失敗根源於主成分更新與RL優化之間的根本性錯配。此外,消融實驗表明,極端的參數削減(如VeRA、Rank-1)會嚴重制約推理能力。我們進一步通過消融研究與規模化實驗驗證了這些發現。本研究為倡導參數高效RL方法的深入探索提供了權威指南。
English
We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (e.g., PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (e.g., VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.
PDF51January 1, 2026