基于零阶优化的量化神经网络微调

摘要

随着大语言模型规模呈指数级增长，GPU内存已成为将这些模型适配至下游任务的主要瓶颈。本文旨在通过最小化模型权重、梯度和优化器状态的内存占用，在一个统一框架内突破内存高效训练的极限。我们的核心思路是采用零阶优化方法，通过在正向传播过程中扰动权重来近似梯度方向，从而同时消除梯度和优化器状态。为了进一步减少权重的内存占用，我们采用了模型量化技术，例如将bfloat16转换为int4。然而，直接将零阶优化应用于量化权重存在可行性问题，因为离散权重与连续梯度之间的精度差距会导致需要反复进行去量化和再量化操作。为解决这一难题，我们提出了量化零阶优化（QZO）这一创新方法，该方法通过扰动连续量化尺度来估计梯度，并采用方向导数裁剪技术以稳定训练过程。QZO与基于标量和基于码本的后训练量化方法均正交。相较于bfloat16全参数微调，QZO可将4位大语言模型的总内存成本降低超过18倍，并能在单块24GB GPU上完成Llama-2-13B和Stable Diffusion 3.5 Large的微调。

English

As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose Quantized Zeroth-order Optimization (QZO), a novel approach that perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training. QZO is orthogonal to both scalar-based and codebook-based post-training quantization methods. Compared to full-parameter fine-tuning in bfloat16, QZO can reduce the total memory cost by more than 18times for 4-bit LLMs, and enables fine-tuning Llama-2-13B and Stable Diffusion 3.5 Large within a single 24GB GPU.

基于零阶优化的量化神经网络微调

Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

摘要

Support