LoftQ: 대규모 언어 모델을 위한 LoRA 미세 조정 인지 양자화

초록

양자화(Quantization)는 대규모 언어 모델(LLM)을 서빙하는 데 필수적인 기술이며, 최근에는 LoRA(Low-Rank Adaptation) 미세 조정에도 적용되고 있다. 본 연구에서는 사전 훈련된 모델에 양자화와 LoRA 미세 조정을 동시에 적용하는 시나리오에 초점을 맞춘다. 이러한 경우, 전체 미세 조정과 양자화 및 LoRA 미세 조정 접근법 간의 하위 작업 성능에서 일관된 격차가 관찰되는 것이 일반적이다. 이에 대응하여, 우리는 LoftQ(LoRA-Fine-Tuning-aware Quantization)라는 새로운 양자화 프레임워크를 제안한다. 이 프레임워크는 LLM을 양자화함과 동시에 LoRA 미세 조정을 위한 적절한 저랭크 초기화를 찾는다. 이러한 초기화는 양자화된 모델과 완전 정밀도 모델 간의 불일치를 완화하고, 하위 작업에서의 일반화를 크게 개선한다. 우리는 자연어 이해, 질의 응답, 요약, 자연어 생성 작업에서 이 방법을 평가한다. 실험 결과, 특히 도전적인 2비트 및 2/4비트 혼합 정밀도 환경에서 기존 양자화 방법을 능가하는 높은 효과성을 보여준다. 우리는 코드를 공개할 예정이다.

English

Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrepancy between the quantized and full-precision model and significantly improves the generalization in downstream tasks. We evaluate our method on natural language understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and outperforms existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. We will release our code.

LoftQ: 대규모 언어 모델을 위한 LoRA 미세 조정 인지 양자화

LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

초록

Support