AMSP：通过先进的模型状态分区实现超级缩放LLM训练

摘要

大型语言模型（LLMs）在各种下游任务中展现出令人印象深刻的性能。在训练这些模型时，越来越倾向于在更大的训练规模上处理更多的标记，但使用相对较小的模型尺寸。零冗余优化器（ZeRO），虽然在传统训练环境中有效，但在面对这种新兴范式时会遇到扩展挑战。为此，我们提出了一种新颖的LLM训练框架AMSP，该框架对模型状态进行了细粒度分区，包括参数（P）、梯度（G）和优化器状态（OS）。具体而言，AMSP：（1）构建了一个统一的分区空间，为P、G和OS提供独立的分区策略；（2）整合了一个规模感知的分区器，自主搜索最佳的分区策略；（3）设计了一个专门的通信优化器，以确保有效地管理由不同分区策略引起的数据放置差异。我们的评估显示，AMSP在1024个GPU上实现了高达90.3%的扩展效率。

English

Large Language Models (LLMs) have demonstrated impressive performance across various downstream tasks. When training these models, there is a growing inclination to process more tokens on larger training scales but with relatively smaller model sizes. Zero Redundancy Optimizer (ZeRO), although effective in conventional training environments, grapples with scaling challenges when confronted with this emerging paradigm. To this end, we propose a novel LLM training framework AMSP, which undertakes a granular partitioning of model states, encompassing parameters (P), gradient (G), and optimizer states (OS). Specifically, AMSP(1) builds a unified partitioning space, enabling independent partitioning strategies for P, G, and OS; (2) incorporates a scale-aware partitioner to autonomously search for optimal partitioning strategies: (3) designs a dedicated communication optimizer to ensure proficient management of data placement discrepancies arising from diverse partitioning strategies. Our evaluations show that AMSP achieves up to 90.3% scaling efficiency across 1024 GPUs.

AMSP：通过先进的模型状态分区实现超级缩放LLM训练

AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning

摘要

Support