Q-GaLore：使用INT4投影和层自适应低秩梯度的量化GaLore

摘要

训练大型语言模型（LLMs）需要大量内存，因为参数数量庞大且伴随着大量优化状态。最近的一种方法GaLore通过将权重梯度投影到低秩子空间来减少内存使用，而不影响性能。然而，GaLore依赖于耗时的奇异值分解（SVD）操作来识别子空间，频繁的子空间更新导致了显著的训练时间开销。此外，与更易于调优的情景下的LoRA相比，GaLore在准确性和效率方面提供的改进微乎其微。为了解决这些限制，我们引入了Q-Galore，一种结合量化和低秩投影的全新方法，大幅减少内存使用，超越了GaLore的优势。我们的方法基于两个关键观察：（i）梯度子空间表现出多样的特性，一些层在训练早期就收敛，而其他层则经常发生变化；（ii）投影矩阵对低比特量化具有高度韧性。利用这些见解，Q-GaLore根据其收敛统计数据自适应地更新梯度子空间，实现可比性能，同时显著减少SVD操作的数量。我们将投影矩阵保持在INT4格式，权重保持在INT8格式，结合随机舍入以捕捉累积梯度信息。这种方法仅使用低精度权重就实现了高精度的训练轨迹。我们展示了Q-GaLore在内存效率方面取得了极具竞争力的性能。在预训练阶段，Q-GaLore使得在单个NVIDIA RTX 4060 Ti上从零开始训练一个LLaMA-7B模型仅需16 GB内存。在微调阶段，与LoRA和GaLore相比，它将内存消耗降低了高达50％，同时在相同内存成本下始终胜过QLoRA。

English

Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent subspace updates lead to significant training time overhead. Moreover, GaLore offers minimal improvements in accuracy and efficiency compared to LoRA in more accessible fine-tuning scenarios. To address these limitations, we introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection, surpassing the benefits of GaLore. Our method is based on two key observations: (i) the gradient subspace exhibits diverse properties, with some layers converging early in training while others are subject to frequent changes; (ii) the projection matrices are highly resilient to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on its convergence statistics, achieving comparable performance while significantly reducing the number of SVD operations. We maintain the projection matrices in INT4 format and weights in INT8 format, incorporating stochastic rounding to capture accumulated gradient information. This approach enables a high-precision training trajectory using only low-precision weights. We demonstrate that Q-GaLore achieves highly competitive performance with exceptional memory efficiency. At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory. At fine-tuning, it reduces memory consumption by up to 50% compared to LoRA and GaLore, while consistently outperforming QLoRA at the same memory cost.

Q-GaLore：使用INT4投影和层自适应低秩梯度的量化GaLore

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

摘要

Support