AXLearn:異構基礎設施上的模組化大型模型訓練
AXLearn: Modular Large Model Training on Heterogeneous Infrastructure
July 7, 2025
作者: Mark Lee, Tom Gunter, Chang Lan, John Peebles, Hanzhi Zhou, Kelvin Zou, Sneha Bangalore, Chung-Cheng Chiu, Nan Du, Xianzhi Du, Philipp Dufter, Ruixuan Hou, Haoshuo Huang, Dongseong Hwang, Xiang Kong, Jinhao Lei, Tao Lei, Meng Li, Li Li, Jiarui Lu, Zhiyun Lu, Yiping Ma, David Qiu, Vivek Rathod, Senyu Tong, Zhucheng Tu, Jianyu Wang, Yongqiang Wang, Zirui Wang, Floris Weers, Sam Wiseman, Guoli Yin, Bowen Zhang, Xiyou Zhou, Danyang Zhuo, Cheng Leong, Ruoming Pang
cs.AI
摘要
我们设计并实现了AXLearn,一个旨在促进大规模深度学习模型可扩展及高性能训练的生产级深度学习系统。相较于其他顶尖的深度学习系统,AXLearn独具特色地强调模块化设计,并支持异构硬件基础设施。AXLearn内部软件组件间的接口遵循严格的封装原则,使得不同组件能够灵活组合,从而在异构计算基础设施上加速模型开发与实验进程。我们引入了一种通过代码行数(LoC)复杂度来量化模块化的新方法,该方法展示了我们的系统在组件扩展时如何保持恒定的复杂度,而其他系统则呈现线性或二次方复杂度增长。这一特性使得诸如旋转位置嵌入(RoPE)等功能能够仅需10行代码即可集成到AXLearn的数百个模块中,相比之下,其他系统则需数百行代码。同时,AXLearn在性能上可与最先进的训练系统相媲美。最后,我们分享了在AXLearn开发与运维过程中的实践经验。
English
We design and implement AXLearn, a production deep learning system that
facilitates scalable and high-performance training of large deep learning
models. Compared to other state-of-the-art deep learning systems, AXLearn has a
unique focus on modularity and support for heterogeneous hardware
infrastructure. AXLearn's internal interfaces between software components
follow strict encapsulation, allowing different components to be assembled to
facilitate rapid model development and experimentation on heterogeneous compute
infrastructure. We introduce a novel method of quantifying modularity via
Lines-of-Code (LoC)-complexity, which demonstrates how our system maintains
constant complexity as we scale the components in the system, compared to
linear or quadratic complexity in other systems. This allows integrating
features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred
of modules with just 10 lines of code, compared to hundreds as required in
other systems. At the same time, AXLearn maintains equivalent performance
compared to state-of-the-art training systems. Finally, we share our experience
in the development and operation of AXLearn.