APOLLO: SGDに似たメモリ、AdamWレベルの性能

要旨

大規模言語モデル（LLM）は、特に一般的なAdamWオプティマイザを使用する際に、トレーニング中にメモリを大量に消費することで知られています。このメモリ負担は、より多くまたは高性能なGPUを使用するか、バッチサイズを減らす必要があり、トレーニングのスケーラビリティとスループットが制限されます。この課題に対処するために、さまざまなメモリ効率の良いオプティマイザが提案されてきましたが、それらは重要な課題に直面しています：（i）高コストなSVD演算への依存、（ii）AdamWと比較して大きなパフォーマンスのトレードオフ、および（iii）競争力のあるパフォーマンスを維持するために依然として大幅なオプティマイザメモリのオーバーヘッドが発生します。この研究では、AdamWの学習率適応ルールを構造化された学習率更新として効果的に粗視化できることを特定しました。この洞察に基づき、純粋なランダム射影に基づく補助的な低ランクオプティマイザ状態を使用して学習率スケーリングを近似する、メモリ効率の良いLLM最適化のための近似勾配スケーリング（APOLLO）を提案します。この構造化された学習率更新ルールにより、APOLLOはさらなるメモリ削減に対して非常に耐性があり、同等の事前トレーニングパフォーマンスを提供します。そのランク1バリアントであるAPOLLO-Miniですら、SGDレベルのメモリコストと比較してAdamWよりも優れた事前トレーニングパフォーマンスを達成します。幅広い実験により、APOLLOシリーズがAdamWと同等またはそれ以上のパフォーマンスを達成し、AdamWの最適化状態をほぼ完全に排除することで大幅なメモリの節約を実現していることが示されました。これらの節約は、重要なシステムレベルの利点を提供します：（1）強化されたスループット：8xA100-80GBセットアップでAdamWと比較して3倍のスループットを実現し、4倍大きなバッチサイズをサポートします。（2）モデルのスケーラビリティの向上：システムレベルの最適化を行わずにA100-80GB GPU上でnaive DDPを使用してLLaMA-13Bを事前トレーニングします。（3）低性能GPUにやさしい事前トレーニング：重みの量子化を使用して、単一のGPU上で12GB未満のメモリを使用してLLaMA-7Bを事前トレーニングします。

English

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

APOLLO: SGDに似たメモリ、AdamWレベルの性能

APOLLO: SGD-like Memory, AdamW-level Performance

要旨

Support