AXLearn：异构基础设施上的模块化大模型训练

摘要

我们设计并实现了AXLearn，一个面向生产的深度学习系统，旨在促进大规模深度学习模型的可扩展性和高性能训练。与其它顶尖深度学习系统相比，AXLearn在模块化及对异构硬件基础设施的支持上独具特色。AXLearn内部软件组件间的接口遵循严格的封装原则，使得不同组件能够灵活组合，从而在异构计算基础设施上加速模型开发与实验进程。我们引入了一种通过代码行数（LoC）复杂度量化模块化的新方法，展示了AXLearn在系统组件扩展时如何保持恒定的复杂度，而其他系统则呈现线性或二次方增长。这一特性使得在AXLearn中集成如旋转位置编码（RoPE）等功能，仅需10行代码即可跨越数百个模块，而其他系统则需数百行。同时，AXLearn在性能上保持了与最先进训练系统相当的水平。最后，我们分享了AXLearn在开发与运维过程中的实践经验。

English

We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on heterogeneous compute infrastructure. We introduce a novel method of quantifying modularity via Lines-of-Code (LoC)-complexity, which demonstrates how our system maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in other systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn.

AXLearn：异构基础设施上的模块化大模型训练

AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

摘要

Support