使用零階優化微調量化神經網絡

摘要

隨著大型語言模型規模的指數級增長，GPU記憶體已成為將這些模型適應下游任務的瓶頸。本文旨在通過在統一框架內最小化模型權重、梯度和優化器狀態的記憶體使用，來突破記憶體高效訓練的極限。我們的核心思想是利用零階優化來消除梯度和優化器狀態，該方法通過在前向傳播中擾動權重來近似梯度方向。為了最小化權重的記憶體使用，我們採用模型量化技術，例如將bfloat16轉換為int4。然而，由於離散權重與連續梯度之間的精度差距，直接將零階優化應用於量化權重是不可行的，否則需要進行反量化和重新量化。為克服這一挑戰，我們提出了量化零階優化（QZO），這是一種新穎的方法，它通過擾動連續量化尺度來估計梯度，並使用方向導數裁剪方法來穩定訓練。QZO與基於標量和基於碼本的訓練後量化方法正交。與bfloat16的全參數微調相比，QZO可將4位LLM的總記憶體成本降低超過18倍，並能在單個24GB GPU上微調Llama-2-13B和Stable Diffusion 3.5 Large模型。

English

As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose Quantized Zeroth-order Optimization (QZO), a novel approach that perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training. QZO is orthogonal to both scalar-based and codebook-based post-training quantization methods. Compared to full-parameter fine-tuning in bfloat16, QZO can reduce the total memory cost by more than 18times for 4-bit LLMs, and enables fine-tuning Llama-2-13B and Stable Diffusion 3.5 Large within a single 24GB GPU.

使用零階優化微調量化神經網絡

Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

摘要

Support