ChatPaper.aiChatPaper

QA-LoRA:大型語言模型的量化感知低秩調適

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

September 26, 2023
作者: Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhensu Chen, Xiaopeng Zhang, Qi Tian
cs.AI

摘要

近年來,大型語言模型(LLMs)的快速發展備受矚目。儘管在許多語言理解任務中具有強大能力,但龐大的計算負擔往往限制了LLMs的應用,尤其是當需要將它們部署到邊緣設備時。本文提出了一種量化感知低秩適應(QA-LoRA)算法。動機在於量化和適應的自由度不平衡,解決方案是使用分組運算符,這些運算符增加了量化的自由度,同時減少了適應的自由度。QA-LoRA 可以輕鬆實現,只需幾行代碼,它為原始 LoRA 裝備了雙重能力:(i)在微調期間,LLM 的權重被量化(例如,轉換為 INT4)以減少時間和內存使用;(ii)微調後,LLM 和輔助權重被自然地集成到一個量化模型中,而不會損失準確性。我們將 QA-LoRA 應用於 LLaMA 和 LLaMA2 模型系列,並驗證了其在不同微調數據集和下游場景中的有效性。代碼將在 https://github.com/yuhuixu1993/qa-lora 提供。
English
Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at https://github.com/yuhuixu1993/qa-lora.
PDF448December 15, 2024