Q-GaLore: INT4投影とレイヤ適応型低ランク勾配を備えた量子化GaLore

要旨

大規模言語モデル（LLM）のトレーニングは、膨大なパラメータ数と関連する最適化状態のため、メモリを大量に消費します。最近提案されたGaLoreという手法は、性能を損なうことなく、重み勾配を低ランク部分空間に射影することでメモリ使用量を削減します。しかし、GaLoreは部分空間を特定するために時間のかかる特異値分解（SVD）操作に依存しており、頻繁な部分空間の更新がトレーニング時間の大幅なオーバーヘッドを引き起こします。さらに、GaLoreは、よりアクセスしやすいファインチューニングシナリオにおいて、LoRAと比較して精度と効率の向上が限定的です。これらの制約を解決するため、我々は量子化と低ランク射影を組み合わせることでメモリ使用量を大幅に削減し、GaLoreの利点を上回る新しいアプローチであるQ-GaLoreを提案します。我々の手法は、以下の2つの重要な観察に基づいています：(i) 勾配部分空間は多様な特性を示し、一部の層はトレーニングの早い段階で収束する一方、他の層は頻繁に変化する；(ii) 射影行列は低ビット量子化に対して非常に耐性がある。これらの知見を活用し、Q-GaLoreは勾配部分空間の収束統計に基づいて適応的に部分空間を更新し、SVD操作の回数を大幅に削減しながら同等の性能を達成します。我々は射影行列をINT4形式、重みをINT8形式で維持し、蓄積された勾配情報を捕捉するために確率的丸めを組み込みます。このアプローチにより、低精度の重みのみを使用して高精度のトレーニング軌跡を実現します。Q-GaLoreが非常に競争力のある性能と卓越したメモリ効率を達成することを実証します。事前学習では、Q-GaLoreは16GBメモリの単一のNVIDIA RTX 4060 TiでLLaMA-7Bモデルをゼロからトレーニングすることを可能にします。ファインチューニングでは、LoRAやGaLoreと比較してメモリ消費量を最大50%削減し、同じメモリコストでQLoRAを一貫して上回ります。

English

Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent subspace updates lead to significant training time overhead. Moreover, GaLore offers minimal improvements in accuracy and efficiency compared to LoRA in more accessible fine-tuning scenarios. To address these limitations, we introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection, surpassing the benefits of GaLore. Our method is based on two key observations: (i) the gradient subspace exhibits diverse properties, with some layers converging early in training while others are subject to frequent changes; (ii) the projection matrices are highly resilient to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on its convergence statistics, achieving comparable performance while significantly reducing the number of SVD operations. We maintain the projection matrices in INT4 format and weights in INT8 format, incorporating stochastic rounding to capture accumulated gradient information. This approach enables a high-precision training trajectory using only low-precision weights. We demonstrate that Q-GaLore achieves highly competitive performance with exceptional memory efficiency. At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory. At fine-tuning, it reduces memory consumption by up to 50% compared to LoRA and GaLore, while consistently outperforming QLoRA at the same memory cost.

Q-GaLore: INT4投影とレイヤ適応型低ランク勾配を備えた量子化GaLore

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

要旨

Support