AXLearn: 이기종 인프라에서의 모듈형 대규모 모델 학습

초록

우리는 대규모 딥러닝 모델의 확장 가능하고 고성능인 학습을 용이하게 하는 프로덕션 딥러닝 시스템인 AXLearn을 설계하고 구현했습니다. 다른 최첨단 딥러닝 시스템과 비교하여, AXLearn은 모듈성과 이기종 하드웨어 인프라 지원에 독특한 초점을 맞추고 있습니다. AXLearn의 소프트웨어 구성 요소 간 내부 인터페이스는 엄격한 캡슐화를 따르며, 이기종 컴퓨팅 인프라에서 신속한 모델 개발과 실험을 가능하게 하기 위해 다양한 구성 요소를 조립할 수 있도록 합니다. 우리는 Lines-of-Code (LoC) 복잡성을 통해 모듈성을 정량화하는 새로운 방법을 소개하며, 이는 우리 시스템이 다른 시스템에서의 선형 또는 이차 복잡성과 달리 시스템 구성 요소를 확장함에 따라 일정한 복잡성을 유지하는 방식을 보여줍니다. 이를 통해 Rotary Position Embeddings (RoPE)와 같은 기능을 AXLearn에 통합할 때, 다른 시스템에서는 수백 줄의 코드가 필요한 반면, AXLearn에서는 단 10줄의 코드로 수백 개의 모듈에 걸쳐 이를 가능하게 합니다. 동시에, AXLearn은 최첨단 학습 시스템과 동등한 성능을 유지합니다. 마지막으로, AXLearn의 개발과 운영 과정에서 얻은 경험을 공유합니다.

English

We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on heterogeneous compute infrastructure. We introduce a novel method of quantifying modularity via Lines-of-Code (LoC)-complexity, which demonstrates how our system maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in other systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn.

AXLearn: 이기종 인프라에서의 모듈형 대규모 모델 학습

AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

초록

Support