ChatPaper.aiChatPaper

QA-LoRA:大型语言模型的量化感知低秩调整

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

September 26, 2023
作者: Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhensu Chen, Xiaopeng Zhang, Qi Tian
cs.AI

摘要

近年来,大型语言模型(LLMs)的快速发展备受关注。尽管在许多语言理解任务中具有强大能力,但巨大的计算负担在很大程度上限制了LLMs的应用,特别是当需要将它们部署到边缘设备时。本文提出了一种量化感知低秩适应(QA-LoRA)算法。动机在于量化和适应的自由度不平衡,解决方案是使用增加量化自由度同时减少适应自由度的分组操作符。QA-LoRA可以轻松实现,只需几行代码,它为原始LoRA增加了双重能力:(i)在微调期间,LLM的权重被量化(例如,转换为INT4)以减少时间和内存使用;(ii)在微调后,LLM和辅助权重被自然集成到一个量化模型中,而不会丢失准确性。我们将QA-LoRA应用于LLaMA和LLaMA2模型系列,并验证了它在不同微调数据集和下游场景中的有效性。代码将在https://github.com/yuhuixu1993/qa-lora 上提供。
English
Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at https://github.com/yuhuixu1993/qa-lora.
PDF448December 15, 2024