LoRA学习更少,遗忘也更少。
LoRA Learns Less and Forgets Less
May 15, 2024
作者: Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham
cs.AI
摘要
低秩适应(Low-Rank Adaptation,LoRA)是一种广泛使用的大型语言模型参数高效微调方法。LoRA通过仅训练对选定权重矩阵进行低秩扰动来节省内存。在这项工作中,我们比较了LoRA和完全微调在两个目标领域(编程和数学)上的性能。我们考虑了指令微调(约100K个提示-响应对)和继续预训练(约10B个非结构化标记)数据制度。我们的结果显示,在大多数情况下,LoRA的性能明显低于完全微调。然而,LoRA表现出一种理想的正则化形式:它更好地保持了基础模型在目标领域之外任务上的性能。我们展示了LoRA相比于常见技术如权重衰减和丢弃提供了更强的正则化;它还有助于保持更多样化的生成。我们展示了完全微调学习的扰动的秩比典型LoRA配置高10-100倍,这可能解释了一些报道的差距。最后,我们提出了使用LoRA进行微调的最佳实践。
English
Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning
method for large language models. LoRA saves memory by training only low rank
perturbations to selected weight matrices. In this work, we compare the
performance of LoRA and full finetuning on two target domains, programming and
mathematics. We consider both the instruction finetuning (approx100K
prompt-response pairs) and continued pretraining (approx10B unstructured
tokens) data regimes. Our results show that, in most settings, LoRA
substantially underperforms full finetuning. Nevertheless, LoRA exhibits a
desirable form of regularization: it better maintains the base model's
performance on tasks outside the target domain. We show that LoRA provides
stronger regularization compared to common techniques such as weight decay and
dropout; it also helps maintain more diverse generations. We show that full
finetuning learns perturbations with a rank that is 10-100X greater than
typical LoRA configurations, possibly explaining some of the reported gaps. We
conclude by proposing best practices for finetuning with LoRA.Summary
AI-Generated Summary