LoRA는 더 적게 학습하고 더 적게 잊는다

초록

Low-Rank Adaptation (LoRA)은 대규모 언어 모델을 위한 널리 사용되는 파라미터 효율적 미세 조정 방법입니다. LoRA는 선택된 가중치 행렬에 대해 낮은 순위의 변동만을 학습함으로써 메모리를 절약합니다. 본 연구에서는 프로그래밍과 수학이라는 두 가지 대상 도메인에서 LoRA와 전체 미세 조정의 성능을 비교합니다. 우리는 명령어 미세 조정(약 100K개의 프롬프트-응답 쌍)과 지속적 사전 학습(약 10B개의 비정형 토큰) 데이터 체계를 모두 고려합니다. 우리의 결과는 대부분의 설정에서 LoRA가 전체 미세 조정에 비해 상당히 낮은 성능을 보인다는 것을 나타냅니다. 그럼에도 불구하고, LoRA는 바람직한 형태의 정규화를 보여줍니다: 대상 도메인 외의 작업에서 기본 모델의 성능을 더 잘 유지합니다. 우리는 LoRA가 가중치 감쇠와 드롭아웃과 같은 일반적인 기술에 비해 더 강력한 정규화를 제공하며, 더 다양한 생성물을 유지하는 데 도움이 된다는 것을 보여줍니다. 전체 미세 조정은 일반적인 LoRA 구성보다 10-100배 더 큰 순위의 변동을 학습하는 것으로 나타나, 보고된 격차의 일부를 설명할 수 있습니다. 우리는 LoRA를 사용한 미세 조정을 위한 최선의 실천 방법을 제안하며 결론을 맺습니다.

English

Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (approx100K prompt-response pairs) and continued pretraining (approx10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model's performance on tasks outside the target domain. We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

LoRA는 더 적게 학습하고 더 적게 잊는다

LoRA Learns Less and Forgets Less

초록

Support