LoRAは学習量が少なく、忘れることも少ない

要旨

Low-Rank Adaptation (LoRA) は、大規模言語モデルのパラメータ効率的なファインチューニング手法として広く使用されています。LoRAは、選択された重み行列に対する低ランクの摂動のみを訓練することでメモリを節約します。本研究では、プログラミングと数学という2つのターゲットドメインにおいて、LoRAと完全なファインチューニングの性能を比較します。命令ファインチューニング（約10万のプロンプト-応答ペア）と継続事前学習（約100億の非構造化トークン）のデータ体制の両方を検討します。結果は、ほとんどの設定において、LoRAが完全なファインチューニングに大きく劣ることを示しています。しかし、LoRAは望ましい正則化の形式を示します：ターゲットドメイン外のタスクにおいて、ベースモデルの性能をより良く維持します。LoRAは、重み減衰やドロップアウトなどの一般的な手法と比較して、より強い正則化を提供し、多様な生成を維持するのに役立つことを示します。完全なファインチューニングは、典型的なLoRA設定よりも10～100倍大きいランクの摂動を学習することを示し、報告されたギャップの一部を説明する可能性があります。最後に、LoRAを使用したファインチューニングのベストプラクティスを提案します。

English

Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (approx100K prompt-response pairs) and continued pretraining (approx10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model's performance on tasks outside the target domain. We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

LoRAは学習量が少なく、忘れることも少ない

LoRA Learns Less and Forgets Less

要旨

Support