제로스 오더 최적화를 통한 양자화 신경망 미세 조정

초록

대규모 언어 모델의 크기가 기하급수적으로 증가함에 따라 GPU 메모리는 이러한 모델을 다운스트림 작업에 적용하는 데 있어 병목 현상이 되었습니다. 본 논문에서는 모델 가중치, 그래디언트, 옵티마이저 상태에 대한 메모리 사용을 최소화하여 메모리 효율적인 훈련의 한계를 극복하고자 합니다. 우리의 아이디어는 제로차 최적화를 통해 그래디언트와 옵티마이저 상태를 모두 제거하는 것인데, 이는 순전파 과정에서 가중치를 섭동시켜 그래디언트 방향을 파악함으로써 그래디언트를 근사하는 방법입니다. 가중치에 대한 메모리 사용을 최소화하기 위해 모델 양자화를 사용하며, 예를 들어 bfloat16에서 int4로 변환합니다. 그러나 양자화된 가중치에 제로차 최적화를 직접 적용하는 것은 이산화된 가중치와 연속적인 그래디언트 간의 정밀도 차이로 인해 불가능하며, 이는 디양자화와 재양자화를 필요로 합니다. 이러한 문제를 극복하기 위해, 우리는 연속적인 양자화 스케일을 섭동시켜 그래디언트를 추정하고 훈련을 안정화하기 위한 방향성 도함수 클리핑 방법을 사용하는 양자화된 제로차 최적화(Quantized Zeroth-order Optimization, QZO)라는 새로운 접근 방식을 제안합니다. QZO는 스칼라 기반 및 코드북 기반의 사후 훈련 양자화 방법과 모두 직교합니다. bfloat16에서의 전체 파라미터 미세 조정과 비교했을 때, QZO는 4비트 LLM의 총 메모리 비용을 18배 이상 줄일 수 있으며, 단일 24GB GPU 내에서 Llama-2-13B와 Stable Diffusion 3.5 Large의 미세 조정을 가능하게 합니다.

English

As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose Quantized Zeroth-order Optimization (QZO), a novel approach that perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training. QZO is orthogonal to both scalar-based and codebook-based post-training quantization methods. Compared to full-parameter fine-tuning in bfloat16, QZO can reduce the total memory cost by more than 18times for 4-bit LLMs, and enables fine-tuning Llama-2-13B and Stable Diffusion 3.5 Large within a single 24GB GPU.

제로스 오더 최적화를 통한 양자화 신경망 미세 조정

Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

초록

Support