ChatPaper.aiChatPaper

评估强化学习与视觉推理的参数高效方法

Evaluating Parameter Efficient Methods for RLVR

December 29, 2025
作者: Qingyu Yin, Yulun Wu, Zhennan Shen, Sunbowen Li, Zhilin Wang, Yanshu Li, Chak Tou Leong, Jiale Kang, Jinjin Gu
cs.AI

摘要

我们系统性地评估了可验证奖励强化学习(RLVR)范式下的参数高效微调(PEFT)方法。RLVR通过可验证反馈激励语言模型增强推理能力;然而尽管LoRA等方法被广泛使用,适用于RLVR的最佳PEFT架构仍未明确。本研究首次在DeepSeek-R1-Distill系列模型上对12种PEFT方法进行数学推理基准的全面评估。实证结果对默认采用标准LoRA的做法提出挑战,主要发现有三:首先,我们证明DoRA、AdaLoRA和MiSS等结构变体持续优于LoRA;其次,我们发现SVD初始化策略(如PiSSA、MiLoRA)存在谱崩溃现象,其失效根源在于主成分更新与RL优化的根本性错配;此外,消融实验表明极端参数压缩(如VeRA、Rank-1)会严重制约推理能力。我们通过消融研究与规模扩展实验进一步验证了这些发现。本研究为倡导参数高效RL方法的深入探索提供了权威指导。
English
We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (e.g., PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (e.g., VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.
PDF51January 1, 2026