QLoRA: Efficiënte Fine-tuning van Gekwantiseerde Taalmodellen

Samenvatting

We presenteren QLoRA, een efficiënte fine-tuningbenadering die het geheugengebruik zodanig reduceert dat het mogelijk wordt om een model met 65B parameters te finetunen op een enkele 48GB GPU, terwijl de volledige 16-bit fine-tuningprestatie behouden blijft. QLoRA propageert gradients terug door een bevroren, 4-bit gekwantiseerd voorgetraind taalmodel naar Low Rank Adapters (LoRA). Onze beste modelfamilie, die we Guanaco noemen, overtreft alle eerder openbaar vrijgegeven modellen op de Vicuna-benchmark en bereikt 99,3% van de prestatie van ChatGPT, terwijl slechts 24 uur fine-tuning op een enkele GPU nodig is. QLoRA introduceert een aantal innovaties om geheugen te besparen zonder prestaties op te offeren: (a) 4-bit NormalFloat (NF4), een nieuw datatype dat informatie-theoretisch optimaal is voor normaal verdeelde gewichten, (b) dubbele kwantisatie om het gemiddelde geheugengebruik te verminderen door de kwantisatieconstanten te kwantiseren, en (c) gepagineerde optimalisatoren om geheugenpieken te beheren. We gebruiken QLoRA om meer dan 1.000 modellen te finetunen en bieden een gedetailleerde analyse van instructievolging en chatbotprestaties over 8 instructiedatasets, meerdere modeltypen (LLaMA, T5), en modelschalen die onhaalbaar zouden zijn met reguliere fine-tuning (bijv. 33B en 65B parametermodellen). Onze resultaten laten zien dat QLoRA-finetuning op een kleine, hoogwaardige dataset leidt tot state-of-the-art resultaten, zelfs bij gebruik van kleinere modellen dan de vorige SoTA. We bieden een gedetailleerde analyse van chatbotprestaties gebaseerd op zowel menselijke als GPT-4-evaluaties, waaruit blijkt dat GPT-4-evaluaties een goedkope en redelijke alternatief zijn voor menselijke evaluatie. Bovendien constateren we dat huidige chatbotbenchmarks niet betrouwbaar zijn om de prestatielevels van chatbots nauwkeurig te evalueren. Een 'lemon-picked' analyse toont aan waar Guanaco faalt in vergelijking met ChatGPT. We maken al onze modellen en code openbaar, inclusief CUDA-kernels voor 4-bit training.

English

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.

QLoRA: Efficiënte Fine-tuning van Gekwantiseerde Taalmodellen

QLoRA: Efficient Finetuning of Quantized LLMs

Samenvatting

Support