草：使用結構化稀疏梯度進行高效低內存LLM訓練

摘要

大型語言模型（LLM）的訓練和微調通常受限於有限的 GPU 記憶體。現有的基於投影的優化方法通過將梯度投影到較低維度的子空間來減少優化器狀態記憶體，但它們通常依賴於密集的投影矩陣，這可能引入計算和記憶體開銷。在這項工作中，我們提出了Grass（GRAdient Stuctured Sparsification），一種利用稀疏投影將梯度轉換為結構化稀疏更新的新方法。這種設計不僅顯著降低了優化器狀態的記憶體使用量，還最小化了梯度的記憶體佔用量、計算和通信成本，從而帶來實質的吞吐量改進。對預訓練和微調任務的大量實驗表明，Grass 實現了與完整排名訓練和現有基於投影的方法相當的性能。值得注意的是，Grass 實現了在單個 40GB A100 GPU 上半精度預訓練 13B 參數的 LLaMA 模型，這是以前方法無法實現的壯舉，並在 8-GPU 系統上實現了高達 2 倍的吞吐量改進。代碼可在 https://github.com/aashiqmuhamed/GRASS 找到。

English

Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory. While existing projection-based optimization methods address this by projecting gradients into a lower-dimensional subspace to reduce optimizer state memory, they typically rely on dense projection matrices, which can introduce computational and memory overheads. In this work, we propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates. This design not only significantly reduces memory usage for optimizer states but also minimizes gradient memory footprint, computation, and communication costs, leading to substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that Grass achieves competitive performance to full-rank training and existing projection-based methods. Notably, Grass enables half-precision pretraining of a 13B parameter LLaMA model on a single 40GB A100 GPU--a feat infeasible for previous methods--and yields up to a 2times throughput improvement on an 8-GPU system. Code can be found at https://github.com/aashiqmuhamed/GRASS .

草：使用結構化稀疏梯度進行高效低內存LLM訓練

Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

摘要

Support