QLoRA：量化語言模型的高效微調

摘要

我們提出了 QLoRA，一種高效的微調方法，可以降低內存使用量，足以在單個 48GB GPU 上微調一個 65B 參數模型，同時保持完整的 16 位微調任務性能。QLoRA 通過凍結的 4 位量化預訓練語言模型向低秩適配器（LoRA）反向傳播梯度。我們命名為 Guanaco 的最佳模型系列在 Vicuna 基準測試中表現優異，超越了先前公開發布的所有模型，達到 ChatGPT 性能水平的 99.3%，僅需要在單個 GPU 上進行 24 小時的微調。QLoRA 引入了一些創新來節省內存而不影響性能：(a) 4 位 NormalFloat（NF4），這是一種對於正態分佈權重而言在信息理論上最優的新數據類型；(b) 雙重量化以減少平均內存佔用量，通過量化量化常數；以及 (c) 分頁優化器以管理內存峰值。我們使用 QLoRA 來微調 1,000 多個模型，對 8 個指令數據集、多個模型類型（LLaMA、T5）和以往無法使用常規微調運行的模型規模（例如 33B 和 65B 參數模型）進行了詳細的指令跟隨和聊天機器人性能分析。我們的結果表明，QLoRA 在一個小型高質量數據集上進行微調可以達到最先進的結果，即使使用比以前的最先進模型更小的模型。我們提供了基於人類和 GPT-4 評估的聊天機器人性能詳細分析，顯示 GPT-4 評估是一種廉價且合理的人類評估替代方案。此外，我們發現目前的聊天機器人基準測試無法準確評估聊天機器人的性能水平。通過一個精心挑選的分析，展示了 Guanaco 與 ChatGPT 相比的失敗之處。我們公開了所有模型和代碼，包括用於 4 位訓練的 CUDA 內核。

English

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.

QLoRA：量化語言模型的高效微調

QLoRA: Efficient Finetuning of Quantized LLMs

摘要

Support