QLoRA:量化大型语言模型的高效微调
QLoRA: Efficient Finetuning of Quantized LLMs
May 23, 2023
作者: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
cs.AI
摘要
我们提出了QLoRA,一种高效的微调方法,可以降低内存使用量,从而能够在单个48GB GPU上微调一个65B参数模型,同时保持完整的16位微调任务性能。QLoRA通过将梯度反向传播到一个冻结的、4位量化的预训练语言模型,进而进入低秩适配器(LoRA)。我们的最佳模型系列,命名为Guanaco,在Vicuna基准测试中胜过了所有先前公开发布的模型,达到了ChatGPT性能水平的99.3%,而仅需要在单个GPU上进行24小时的微调。QLoRA引入了许多创新来节省内存而不牺牲性能:(a)4位NormalFloat(NF4),这是一种对于正态分布权重来说在信息论上是最优的新数据类型;(b)双量化,通过量化量化常数来减少平均内存占用;以及(c)分页优化器来管理内存峰值。我们使用QLoRA来微调1000多个模型,并对8个指令数据集、多个模型类型(LLaMA、T5)以及以往难以运行的模型规模(例如33B和65B参数模型)的指令遵循和聊天机器人性能进行了详细分析。我们的结果表明,在小规模高质量数据集上使用QLoRA微调可以达到最先进的结果,即使使用比以前最先进模型更小的模型。我们提供了基于人类和GPT-4评估的聊天机器人性能的详细分析,表明GPT-4评估是一种廉价且合理的人类评估替代方案。此外,我们发现当前的聊天机器人基准测试并不可信,无法准确评估聊天机器人的性能水平。通过柠檬挑选的分析展示了Guanaco相对于ChatGPT的失败之处。我们发布了所有模型和代码,包括用于4位训练的CUDA核心。
English
We present QLoRA, an efficient finetuning approach that reduces memory usage
enough to finetune a 65B parameter model on a single 48GB GPU while preserving
full 16-bit finetuning task performance. QLoRA backpropagates gradients through
a frozen, 4-bit quantized pretrained language model into Low Rank
Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all
previous openly released models on the Vicuna benchmark, reaching 99.3% of the
performance level of ChatGPT while only requiring 24 hours of finetuning on a
single GPU. QLoRA introduces a number of innovations to save memory without
sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is
information theoretically optimal for normally distributed weights (b) double
quantization to reduce the average memory footprint by quantizing the
quantization constants, and (c) paged optimziers to manage memory spikes. We
use QLoRA to finetune more than 1,000 models, providing a detailed analysis of
instruction following and chatbot performance across 8 instruction datasets,
multiple model types (LLaMA, T5), and model scales that would be infeasible to
run with regular finetuning (e.g. 33B and 65B parameter models). Our results
show that QLoRA finetuning on a small high-quality dataset leads to
state-of-the-art results, even when using smaller models than the previous
SoTA. We provide a detailed analysis of chatbot performance based on both human
and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable
alternative to human evaluation. Furthermore, we find that current chatbot
benchmarks are not trustworthy to accurately evaluate the performance levels of
chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to
ChatGPT. We release all of our models and code, including CUDA kernels for
4-bit training.