Grass: 構造化スパース勾配を用いた計算効率の高い低メモリLLMトレーニング

要旨

大規模言語モデル（LLM）の学習とファインチューニングは、多くの場合、限られたGPUメモリによってボトルネックが生じます。既存の射影ベースの最適化手法は、オプティマイザの状態メモリを削減するために勾配を低次元部分空間に射影することでこの問題に対処していますが、通常は密な射影行列に依存しており、計算とメモリのオーバーヘッドを引き起こす可能性があります。本研究では、Grass（GRAdient Structured Sparsification）という新しいアプローチを提案します。この手法は、スパース射影を活用して勾配を構造化されたスパース更新に変換します。この設計により、オプティマイザの状態メモリ使用量が大幅に削減されるだけでなく、勾配メモリのフットプリント、計算コスト、通信コストも最小化され、スループットの大幅な向上が実現されます。事前学習とファインチューニングタスクにおける広範な実験により、Grassはフルランク学習および既存の射影ベース手法と競合する性能を達成することが示されました。特に、Grassは、13BパラメータのLLaMAモデルの半精度事前学習を単一の40GB A100 GPUで可能にし、これは従来の手法では実現不可能な成果であり、8-GPUシステムでは最大2倍のスループット向上をもたらします。コードはhttps://github.com/aashiqmuhamed/GRASSで公開されています。

English

Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory. While existing projection-based optimization methods address this by projecting gradients into a lower-dimensional subspace to reduce optimizer state memory, they typically rely on dense projection matrices, which can introduce computational and memory overheads. In this work, we propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates. This design not only significantly reduces memory usage for optimizer states but also minimizes gradient memory footprint, computation, and communication costs, leading to substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that Grass achieves competitive performance to full-rank training and existing projection-based methods. Notably, Grass enables half-precision pretraining of a 13B parameter LLaMA model on a single 40GB A100 GPU--a feat infeasible for previous methods--and yields up to a 2times throughput improvement on an 8-GPU system. Code can be found at https://github.com/aashiqmuhamed/GRASS .

Grass: 構造化スパース勾配を用いた計算効率の高い低メモリLLMトレーニング

Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

要旨

Support