ゼロ次最適化を用いた量子化ニューラルネットワークのファインチューニング

要旨

大規模言語モデルのサイズが指数関数的に増大するにつれ、GPUメモリはこれらのモデルを下流タスクに適応させる際のボトルネックとなっている。本論文では、モデルの重み、勾配、オプティマイザの状態におけるメモリ使用量を最小化し、メモリ効率の良い学習の限界を押し広げることを目指す。我々のアイデアは、ゼロ次最適化を用いて勾配とオプティマイザの状態の両方を排除することである。ゼロ次最適化では、フォワードパス中に重みを摂動させることで勾配方向を特定し、勾配を近似する。重みのメモリ使用量を最小化するために、モデル量子化（例えば、bfloat16からint4への変換）を採用する。しかし、量子化された重みに直接ゼロ次最適化を適用することは、離散的な重みと連続的な勾配の間の精度ギャップのために不可能であり、これには量子化解除と再量子化が必要となる。この課題を克服するために、我々は量子化ゼロ次最適化（QZO）を提案する。QZOは、勾配推定のために連続的な量子化スケールを摂動させ、学習を安定化するために方向微分クリッピング法を使用する。QZOは、スカラーベースおよびコードブックベースのポストトレーニング量子化手法の両方に対して直交的である。bfloat16での全パラメータファインチューニングと比較して、QZOは4ビットLLMの総メモリコストを18倍以上削減し、単一の24GB GPU内でLlama-2-13BとStable Diffusion 3.5 Largeのファインチューニングを可能にする。

English

As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose Quantized Zeroth-order Optimization (QZO), a novel approach that perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training. QZO is orthogonal to both scalar-based and codebook-based post-training quantization methods. Compared to full-parameter fine-tuning in bfloat16, QZO can reduce the total memory cost by more than 18times for 4-bit LLMs, and enables fine-tuning Llama-2-13B and Stable Diffusion 3.5 Large within a single 24GB GPU.

ゼロ次最適化を用いた量子化ニューラルネットワークのファインチューニング

Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

要旨

Support