草:使用結構化稀疏梯度進行高效低內存LLM訓練
Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
June 25, 2024
作者: Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, Virginia Smith
cs.AI
摘要
大型語言模型(LLM)的訓練和微調通常受限於有限的 GPU 記憶體。現有的基於投影的優化方法通過將梯度投影到較低維度的子空間來減少優化器狀態記憶體,但它們通常依賴於密集的投影矩陣,這可能引入計算和記憶體開銷。在這項工作中,我們提出了Grass(GRAdient Stuctured Sparsification),一種利用稀疏投影將梯度轉換為結構化稀疏更新的新方法。這種設計不僅顯著降低了優化器狀態的記憶體使用量,還最小化了梯度的記憶體佔用量、計算和通信成本,從而帶來實質的吞吐量改進。對預訓練和微調任務的大量實驗表明,Grass 實現了與完整排名訓練和現有基於投影的方法相當的性能。值得注意的是,Grass 實現了在單個 40GB A100 GPU 上半精度預訓練 13B 參數的 LLaMA 模型,這是以前方法無法實現的壯舉,並在 8-GPU 系統上實現了高達 2 倍的吞吐量改進。代碼可在 https://github.com/aashiqmuhamed/GRASS 找到。
English
Large language model (LLM) training and finetuning are often bottlenecked by
limited GPU memory. While existing projection-based optimization methods
address this by projecting gradients into a lower-dimensional subspace to
reduce optimizer state memory, they typically rely on dense projection
matrices, which can introduce computational and memory overheads. In this work,
we propose Grass (GRAdient Stuctured Sparsification), a novel approach that
leverages sparse projections to transform gradients into structured sparse
updates. This design not only significantly reduces memory usage for optimizer
states but also minimizes gradient memory footprint, computation, and
communication costs, leading to substantial throughput improvements. Extensive
experiments on pretraining and finetuning tasks demonstrate that Grass achieves
competitive performance to full-rank training and existing projection-based
methods. Notably, Grass enables half-precision pretraining of a 13B parameter
LLaMA model on a single 40GB A100 GPU--a feat infeasible for previous
methods--and yields up to a 2times throughput improvement on an 8-GPU
system. Code can be found at https://github.com/aashiqmuhamed/GRASS .Summary
AI-Generated Summary