AMSP：高度なモデル状態分割によるLLMトレーニングの超スケーリング

要旨

大規模言語モデル（LLMs）は、様々な下流タスクにおいて印象的な性能を発揮しています。これらのモデルを訓練する際、より多くのトークンを処理しつつ、比較的小さなモデルサイズで大規模な訓練を行う傾向が高まっています。Zero Redundancy Optimizer（ZeRO）は、従来の訓練環境では効果的ですが、この新たなパラダイムに直面した際にスケーリングの課題に直面します。これに対処するため、我々は新しいLLM訓練フレームワークAMSPを提案します。AMSPは、パラメータ（P）、勾配（G）、オプティマイザ状態（OS）を含むモデル状態を細かく分割します。具体的には、AMSPは(1)統一された分割空間を構築し、P、G、OSに対して独立した分割戦略を可能にします；(2)スケールを考慮したパーティショナーを組み込み、最適な分割戦略を自律的に探索します；(3)異なる分割戦略から生じるデータ配置の不一致を効果的に管理するための専用の通信オプティマイザを設計します。我々の評価では、AMSPは1024 GPUにおいて最大90.3%のスケーリング効率を達成しています。

English

Large Language Models (LLMs) have demonstrated impressive performance across various downstream tasks. When training these models, there is a growing inclination to process more tokens on larger training scales but with relatively smaller model sizes. Zero Redundancy Optimizer (ZeRO), although effective in conventional training environments, grapples with scaling challenges when confronted with this emerging paradigm. To this end, we propose a novel LLM training framework AMSP, which undertakes a granular partitioning of model states, encompassing parameters (P), gradient (G), and optimizer states (OS). Specifically, AMSP(1) builds a unified partitioning space, enabling independent partitioning strategies for P, G, and OS; (2) incorporates a scale-aware partitioner to autonomously search for optimal partitioning strategies: (3) designs a dedicated communication optimizer to ensure proficient management of data placement discrepancies arising from diverse partitioning strategies. Our evaluations show that AMSP achieves up to 90.3% scaling efficiency across 1024 GPUs.

AMSP：高度なモデル状態分割によるLLMトレーニングの超スケーリング

AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning

要旨

Support