AMSP:透過先進的模型狀態分割實現超大規模LLM訓練
AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning
November 1, 2023
作者: Qiaoling Chen, Qinghao Hu, Zhisheng Ye, Guoteng Wang, Peng Sun, Yonggang Wen, Tianwei Zhang
cs.AI
摘要
大型語言模型(LLMs)在各種下游任務中展現了令人印象深刻的性能。在訓練這些模型時,越來越傾向於在更大的訓練規模上處理更多的標記,但模型大小相對較小。零冗餘優化器(ZeRO),儘管在傳統的訓練環境中效果顯著,但在應對這種新興範式時面臨著擴展挑戰。為此,我們提出了一個新穎的LLM訓練框架AMSP,該框架對模型狀態進行了細粒度劃分,包括參數(P)、梯度(G)和優化器狀態(OS)。具體來說,AMSP:(1)構建了一個統一的劃分空間,實現了對P、G和OS的獨立劃分策略;(2)融入了一個具有規模感知能力的劃分器,自主搜索最佳劃分策略;(3)設計了一個專用的通信優化器,以確保有效管理由不同劃分策略引起的數據放置差異。我們的評估顯示,AMSP在1024個GPU上實現了高達90.3%的擴展效率。
English
Large Language Models (LLMs) have demonstrated impressive performance across
various downstream tasks. When training these models, there is a growing
inclination to process more tokens on larger training scales but with
relatively smaller model sizes. Zero Redundancy Optimizer (ZeRO), although
effective in conventional training environments, grapples with scaling
challenges when confronted with this emerging paradigm. To this end, we propose
a novel LLM training framework AMSP, which undertakes a granular partitioning
of model states, encompassing parameters (P), gradient (G), and optimizer
states (OS). Specifically, AMSP(1) builds a unified partitioning space,
enabling independent partitioning strategies for P, G, and OS; (2)
incorporates a scale-aware partitioner to autonomously search for optimal
partitioning strategies: (3) designs a dedicated communication optimizer to
ensure proficient management of data placement discrepancies arising from
diverse partitioning strategies. Our evaluations show that AMSP achieves up to
90.3% scaling efficiency across 1024 GPUs.