ChatPaper.aiChatPaper

揭開真相的面紗:在面向推理的監督微調中,主權重於降維後浮現

LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning

June 1, 2025
作者: Zihang Liu, Tianyu Pang, Oleg Balabanov, Chaoqun Yang, Tianjin Huang, Lu Yin, Yaoqing Yang, Shiwei Liu
cs.AI

摘要

近期研究表明,在少量高品質數據集上對大型語言模型(LLM)進行監督式微調,能夠顯著提升其推理能力。然而,全參數微調(Full FT)雖然強大,卻存在計算成本高、易於過擬合及遭遇災難性遺忘等問題,特別是在數據有限的情況下。稀疏微調通過僅更新模型參數的一小部分,此前已取得顯著成功,在效率與效果之間提供了良好的平衡。但在LLM時代,由於難以識別真正對推理至關重要的參數,稀疏微調的應用有所滯後。本研究中,我們提出,在低秩近似後具有最大幅度的權重是微調的關鍵權重,我們稱之為主權重。令人驚訝的是,基於幅度的稀疏微調作為LLM微調的基線表現不佳,但在秩降低後卻變得極為有效。這些洞見啟發了我們的方法:低秩引導的稀疏微調(LIFT)。LIFT在整個訓練過程中僅更新前5%的主權重,並在推理任務上持續超越Full FT的表現,同時保持與流行的參數高效微調方法相當的記憶體效率。除了在算術推理等目標領域展現出強勁性能外,與Full FT和LoRA相比,LIFT還保留了多達20%的源領域知識。我們的代碼已公開於:https://github.com/zihanghliu/LIFT。
English
Recent studies have shown that supervised fine-tuning of LLMs on a small number of high-quality datasets can yield strong reasoning capabilities. However, full fine-tuning (Full FT), while powerful, is computationally expensive and susceptible to overfitting and catastrophic forgetting, particularly when data is limited. Sparse fine-tuning, which previously achieved notable success by updating only a small subset of model parameters, offers a promising trade-off between efficiency and effectiveness. Yet, it has lagged behind in the LLM era due to the difficulty of identifying parameters truly critical for reasoning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we call Principal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLM fine-tuning, it becomes highly effective after rank reduction. These insights motivate our method: Low-rank Informed Sparse Fine-Tuning (LIFT). LIFT only updates the top 5% Principal Weights throughout training and consistently achieves better performance on reasoning tasks than Full FT, while maintaining memory efficiency on par with popular parameter-efficient fine-tuning methods. In addition to strong performance on target domains such as arithmetic reasoning, LIFT also retains up to 20% more source-domain knowledge, compared to Full FT and LoRA. Our code is available at: https://github.com/zihanghliu/LIFT.
PDF22June 3, 2025