草:使用结构稀疏梯度进行高效低内存的LLM训练
Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
June 25, 2024
作者: Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, Virginia Smith
cs.AI
摘要
大型语言模型(LLM)的训练和微调通常受限于有限的GPU内存。现有的基于投影的优化方法通过将梯度投影到较低维度的子空间中以减少优化器状态内存来解决这一问题,但它们通常依赖于密集投影矩阵,这可能会引入计算和内存开销。在这项工作中,我们提出了Grass(GRAdient Stuctured Sparsification),这是一种利用稀疏投影将梯度转换为结构化稀疏更新的新方法。这种设计不仅显著减少了优化器状态的内存使用,还最小化了梯度内存占用量、计算和通信成本,从而实现了大幅的吞吐量改进。在预训练和微调任务上进行的大量实验表明,Grass实现了与全秩训练和现有基于投影的方法相媲美的性能。值得注意的是,Grass使得在单个40GB A100 GPU上进行13B参数LLaMA模型的半精度预训练成为可能,这是以前方法无法实现的壮举,并在8-GPU系统上实现了高达2倍的吞吐量改进。代码可在https://github.com/aashiqmuhamed/GRASS 找到。
English
Large language model (LLM) training and finetuning are often bottlenecked by
limited GPU memory. While existing projection-based optimization methods
address this by projecting gradients into a lower-dimensional subspace to
reduce optimizer state memory, they typically rely on dense projection
matrices, which can introduce computational and memory overheads. In this work,
we propose Grass (GRAdient Stuctured Sparsification), a novel approach that
leverages sparse projections to transform gradients into structured sparse
updates. This design not only significantly reduces memory usage for optimizer
states but also minimizes gradient memory footprint, computation, and
communication costs, leading to substantial throughput improvements. Extensive
experiments on pretraining and finetuning tasks demonstrate that Grass achieves
competitive performance to full-rank training and existing projection-based
methods. Notably, Grass enables half-precision pretraining of a 13B parameter
LLaMA model on a single 40GB A100 GPU--a feat infeasible for previous
methods--and yields up to a 2times throughput improvement on an 8-GPU
system. Code can be found at https://github.com/aashiqmuhamed/GRASS .Summary
AI-Generated Summary