Q-GaLore: INT4 프로젝션과 계층 적응형 저랭크 그래디언트를 적용한 양자화된 GaLore

초록

대규모 언어 모델(LLMs)을 학습시키는 것은 많은 수의 파라미터와 관련된 최적화 상태로 인해 메모리 사용량이 매우 높습니다. 최근에 제안된 GaLore 방법은 성능 저하 없이 가중치 그래디언트를 저차원 부분공간으로 투영하여 메모리 사용량을 줄입니다. 그러나 GaLore는 부분공간을 식별하기 위해 시간이 많이 소요되는 특이값 분해(SVD) 연산에 의존하며, 빈번한 부분공간 업데이트로 인해 학습 시간이 크게 증가합니다. 또한, GaLore는 접근 가능한 미세 조정 시나리오에서 LoRA와 비교하여 정확도와 효율성 측면에서 미미한 개선만을 제공합니다. 이러한 한계를 해결하기 위해, 우리는 양자화와 저차원 투영을 결합하여 GaLore의 이점을 능가하는 메모리 사용량을 크게 줄이는 새로운 접근 방식인 Q-GaLore를 소개합니다. 우리의 방법은 두 가지 주요 관찰에 기반합니다: (i) 그래디언트 부분공간은 다양한 특성을 보이며, 일부 층은 학습 초기에 수렴하는 반면 다른 층은 빈번한 변화를 겪습니다; (ii) 투영 행렬은 저비트 양자화에 대해 매우 강인합니다. 이러한 통찰을 활용하여, Q-GaLore는 부분공간의 수렴 통계를 기반으로 그래디언트 부분공간을 적응적으로 업데이트하여, SVD 연산 횟수를 크게 줄이면서도 비슷한 성능을 달성합니다. 우리는 투영 행렬을 INT4 형식으로, 가중치를 INT8 형식으로 유지하며, 누적된 그래디언트 정보를 포착하기 위해 확률적 반올림을 적용합니다. 이 접근 방식은 저정밀도 가중치만 사용하여도 고정밀도의 학습 경로를 가능하게 합니다. 우리는 Q-GaLore가 탁월한 메모리 효율성과 함께 매우 경쟁력 있는 성능을 달성함을 보여줍니다. 사전 학습 단계에서 Q-GaLore는 단일 NVIDIA RTX 4060 Ti(16GB 메모리)에서 LLaMA-7B 모델을 처음부터 학습시키는 것을 가능하게 합니다. 미세 조정 단계에서는 LoRA 및 GaLore와 비교하여 메모리 소비를 최대 50%까지 줄이면서도 동일한 메모리 비용에서 QLoRA를 지속적으로 능가합니다.

English

Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent subspace updates lead to significant training time overhead. Moreover, GaLore offers minimal improvements in accuracy and efficiency compared to LoRA in more accessible fine-tuning scenarios. To address these limitations, we introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection, surpassing the benefits of GaLore. Our method is based on two key observations: (i) the gradient subspace exhibits diverse properties, with some layers converging early in training while others are subject to frequent changes; (ii) the projection matrices are highly resilient to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on its convergence statistics, achieving comparable performance while significantly reducing the number of SVD operations. We maintain the projection matrices in INT4 format and weights in INT8 format, incorporating stochastic rounding to capture accumulated gradient information. This approach enables a high-precision training trajectory using only low-precision weights. We demonstrate that Q-GaLore achieves highly competitive performance with exceptional memory efficiency. At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory. At fine-tuning, it reduces memory consumption by up to 50% compared to LoRA and GaLore, while consistently outperforming QLoRA at the same memory cost.

Q-GaLore: INT4 프로젝션과 계층 적응형 저랭크 그래디언트를 적용한 양자화된 GaLore

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

초록

Support