AXLearn：異種インフラストラクチャ上でのモジュラー大規模モデルトレーニング

要旨

私たちは、大規模なディープラーニングモデルのスケーラブルで高性能なトレーニングを容易にするプロダクションレベルのディープラーニングシステム「AXLearn」を設計・実装しました。他の最先端のディープラーニングシステムと比較して、AXLearnはモジュール性と異種ハードウェアインフラストラクチャのサポートに特化しています。AXLearnのソフトウェアコンポーネント間の内部インターフェースは厳密なカプセル化に従っており、異なるコンポーネントを組み立てることで、異種計算インフラストラクチャ上での迅速なモデル開発と実験を可能にします。私たちは、Lines-of-Code（LoC）複雑度を用いてモジュール性を定量化する新たな方法を導入し、システムのコンポーネントをスケールする際に、他のシステムでは線形または二次的な複雑度が生じるのに対し、AXLearnでは一定の複雑度を維持することを示しています。これにより、Rotary Position Embeddings（RoPE）のような機能を、他のシステムでは数百行のコードが必要なところ、AXLearnではわずか10行のコードで数百のモジュールに統合することが可能です。同時に、AXLearnは最先端のトレーニングシステムと同等の性能を維持しています。最後に、AXLearnの開発と運用における経験を共有します。

English

We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on heterogeneous compute infrastructure. We introduce a novel method of quantifying modularity via Lines-of-Code (LoC)-complexity, which demonstrates how our system maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in other systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn.

AXLearn：異種インフラストラクチャ上でのモジュラー大規模モデルトレーニング

AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

要旨

Support