Q-GaLore：具有 INT4 投影和層適應低秩梯度的量化 GaLore

摘要

訓練大型語言模型（LLMs）需要大量記憶體，因為其龐大的參數量和相關的優化狀態。最近的一種方法 GaLore 通過將權重梯度投影到低秩子空間來減少記憶體使用，而不影響性能。然而，GaLore 依賴耗時的奇異值分解（SVD）操作來識別子空間，並且頻繁的子空間更新導致顯著的訓練時間開銷。此外，與更易於進行微調的情況下的 LoRA 相比，GaLore 在準確性和效率方面提供的改進很少。為了解決這些限制，我們提出了一種新方法 Q-Galore，通過結合量化和低秩投影，大幅減少記憶體使用，超越了 GaLore 的好處。我們的方法基於兩個關鍵觀察：（i）梯度子空間表現出多樣性特性，一些層在訓練早期收斂，而其他層則經常變化；（ii）投影矩陣對低位量化非常強韌。利用這些見解，Q-GaLore 根據其收斂統計量自適應地更新梯度子空間，實現可比性能，同時顯著減少 SVD 操作的次數。我們將投影矩陣保持在 INT4 格式，權重保持在 INT8 格式，並引入隨機捨入以捕捉累積的梯度信息。這種方法使得僅使用低精度權重即可實現高精度的訓練軌跡。我們展示了 Q-GaLore 實現了高度競爭性的性能，並具有卓越的記憶體效率。在預訓練階段，Q-GaLore 使得在單個 NVIDIA RTX 4060 Ti 上僅使用 16 GB 記憶體即可從頭開始訓練一個 LLaMA-7B 模型變得更加容易。在微調階段，與 LoRA 和 GaLore 相比，它將記憶體消耗降低了多達 50％，同時在相同的記憶體成本下始終優於 QLoRA。

English

Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent subspace updates lead to significant training time overhead. Moreover, GaLore offers minimal improvements in accuracy and efficiency compared to LoRA in more accessible fine-tuning scenarios. To address these limitations, we introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection, surpassing the benefits of GaLore. Our method is based on two key observations: (i) the gradient subspace exhibits diverse properties, with some layers converging early in training while others are subject to frequent changes; (ii) the projection matrices are highly resilient to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on its convergence statistics, achieving comparable performance while significantly reducing the number of SVD operations. We maintain the projection matrices in INT4 format and weights in INT8 format, incorporating stochastic rounding to capture accumulated gradient information. This approach enables a high-precision training trajectory using only low-precision weights. We demonstrate that Q-GaLore achieves highly competitive performance with exceptional memory efficiency. At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory. At fine-tuning, it reduces memory consumption by up to 50% compared to LoRA and GaLore, while consistently outperforming QLoRA at the same memory cost.

Q-GaLore：具有 INT4 投影和層適應低秩梯度的量化 GaLore

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

摘要

Support