LoftQ:針對大型語言模型的 LoRA 微調感知量化
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
October 12, 2023
作者: Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao
cs.AI
摘要
量化是為大型語言模型(LLMs)提供服務的不可或缺技術,最近也開始應用於 LoRA 微調。在本研究中,我們專注於對預先訓練模型同時應用量化和 LoRA 微調的情況。在這種情況下,通常會觀察到完全微調和量化加 LoRA 微調方法在下游任務表現之間存在一致的差距。為此,我們提出了LoftQ(LoRA微調感知量化),這是一個新穎的量化框架,同時對LLM進行量化並找到適當的低秩初始化值以進行LoRA微調。這種初始化有助於減輕量化模型和全精度模型之間的差異,並顯著改善下游任務的泛化性能。我們在自然語言理解、問答、摘要和自然語言生成任務上評估了我們的方法。實驗表明,我們的方法非常有效,在具有挑戰性的2位元和2/4位元混合精度範圍中特別優於現有的量化方法。我們將釋出我們的程式碼。
English
Quantization is an indispensable technique for serving Large Language Models
(LLMs) and has recently found its way into LoRA fine-tuning. In this work we
focus on the scenario where quantization and LoRA fine-tuning are applied
together on a pre-trained model. In such cases it is common to observe a
consistent gap in the performance on downstream tasks between full fine-tuning
and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ
(LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that
simultaneously quantizes an LLM and finds a proper low-rank initialization for
LoRA fine-tuning. Such an initialization alleviates the discrepancy between the
quantized and full-precision model and significantly improves the
generalization in downstream tasks. We evaluate our method on natural language
understanding, question answering, summarization, and natural language
generation tasks. Experiments show that our method is highly effective and
outperforms existing quantization methods, especially in the challenging 2-bit
and 2/4-bit mixed precision regimes. We will release our code.