QA-LoRA: 대규모 언어 모델의 양자화 인식 저순위 적응

초록

최근 몇 년 동안 대형 언어 모델(LLM)의 급속한 발전이 이루어졌다. 다양한 언어 이해 작업에서 강력한 성능을 보이지만, 특히 에지 디바이스에 배포해야 할 경우, 높은 계산 부담이 LLM의 적용을 크게 제한한다. 본 논문에서는 양자화 인지 저랭크 적응(Quantization-Aware Low-Rank Adaptation, QA-LoRA) 알고리즘을 제안한다. 이 알고리즘의 동기는 양자화와 적응의 자유도 불균형에 있으며, 그 해결책은 그룹 단위 연산자를 사용하여 양자화의 자유도를 증가시키는 동시에 적응의 자유도를 감소시키는 것이다. QA-LoRA는 몇 줄의 코드로 쉽게 구현할 수 있으며, 원래의 LoRA에 두 가지 능력을 부여한다: (i) 미세 조정(fine-tuning) 동안 LLM의 가중치를 양자화(예: INT4로)하여 시간과 메모리 사용량을 줄이고, (ii) 미세 조정 후, LLM과 보조 가중치가 정확도 손실 없이 자연스럽게 양자화된 모델로 통합된다. 우리는 QA-LoRA를 LLaMA 및 LLaMA2 모델 패밀리에 적용하고, 다양한 미세 조정 데이터셋과 다운스트림 시나리오에서 그 효과를 검증한다. 코드는 https://github.com/yuhuixu1993/qa-lora에서 제공될 예정이다.

English

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at https://github.com/yuhuixu1993/qa-lora.

QA-LoRA: 대규모 언어 모델의 양자화 인식 저순위 적응

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

초록

Support