LoRA 學習較少且遺忘較少。
LoRA Learns Less and Forgets Less
May 15, 2024
作者: Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham
cs.AI
摘要
低秩適應(LoRA)是一種廣泛使用的大型語言模型參數高效微調方法。LoRA通過僅對選定的權重矩陣進行低秩擾動訓練以節省內存。在這項研究中,我們比較了LoRA和完整微調在兩個目標領域(編程和數學)上的性能。我們考慮了指令微調(約100K提示-回應對)和持續預訓練(約10B非結構化標記)數據制度。我們的結果顯示,在大多數情況下,LoRA的表現遠遠不及完整微調。然而,LoRA表現出一種理想的正則化形式:它更好地保持了基礎模型在目標領域之外任務上的表現。我們展示了LoRA相對於權重衰減和輸出層dropout等常見技術提供了更強的正則化;它還有助於保持更多樣化的生成。我們展示了完整微調學習的擾動具有比典型LoRA配置高10-100倍的秩,這可能解釋了一些報告中的差距。最後,我們提出了LoRA微調的最佳實踐。
English
Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning
method for large language models. LoRA saves memory by training only low rank
perturbations to selected weight matrices. In this work, we compare the
performance of LoRA and full finetuning on two target domains, programming and
mathematics. We consider both the instruction finetuning (approx100K
prompt-response pairs) and continued pretraining (approx10B unstructured
tokens) data regimes. Our results show that, in most settings, LoRA
substantially underperforms full finetuning. Nevertheless, LoRA exhibits a
desirable form of regularization: it better maintains the base model's
performance on tasks outside the target domain. We show that LoRA provides
stronger regularization compared to common techniques such as weight decay and
dropout; it also helps maintain more diverse generations. We show that full
finetuning learns perturbations with a rank that is 10-100X greater than
typical LoRA configurations, possibly explaining some of the reported gaps. We
conclude by proposing best practices for finetuning with LoRA.Summary
AI-Generated Summary