NoLoCo: 대규모 모델을 위한 No-all-reduce 저통신 학습 방법

초록

대규모 언어 모델을 훈련시키는 것은 일반적으로 고대역폭 상호 연결을 통해 통신하는 수만 개의 가속기로 구성된 클러스터에서 최적화 방법을 통해 이루어집니다. 이러한 클러스터를 확장하는 것은 비용이 많이 들고 비현실적이 될 수 있어, 훈련할 수 있는 모델의 크기에 제한을 가합니다. 최근 몇몇 연구에서는 고도로 연결된 컴퓨팅 클러스터가 필요하지 않도록 통신 집약도가 낮은 훈련 방법을 제안했습니다. 이러한 최신의 저통신 훈련 방법은 여전히 모델 파라미터에 대한 동기화 단계를 사용하며, 이는 모든 모델 복제본에 대해 수행될 때 저대역폭 네트워크에서 비용이 많이 들 수 있습니다. 이 연구에서는 훈련 중에 모든 모델 파라미터를 명시적으로 동기화하지 않아 집합 통신이 필요 없는 새로운 최적화 방법인 NoLoCo를 제안합니다. NoLoCo는 Nesterov 모멘텀 최적화기의 새로운 변형을 통해 무작위로 선택된 다른 모델 가중치와 부분적으로 평균을 내어 모델 가중치를 암묵적으로 동기화합니다. 우리는 제안된 최적화기에 대한 이론적 수렴 분석과 언어 모델 훈련의 실험 결과를 제공합니다. NoLoCo를 125M에서 6.8B 파라미터 사이의 다양한 가속기 수와 모델 크기에서 벤치마크했습니다. 우리의 방법은 완전히 분할된 데이터 병렬 훈련이나 널리 사용되는 저통신 훈련 방법인 DiLoCo보다 훨씬 적은 통신 오버헤드를 요구합니다. 동기화 단계 자체는 인터넷을 통해 수백 개의 가속기를 사용한 DiLoCo의 all-reduce보다 한 차원 빠른 것으로 추정됩니다. 또한 가속기의 유휴 시간을 줄이는 전역 차단 통신이 없습니다. DiLoCo와 비교하여, 다양한 모델 크기와 가속기 수에서 최대 4% 더 빠른 수렴 속도를 관찰했습니다.

English

Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators, communicating over a high-bandwidth interconnect. Scaling up these clusters is expensive and can become impractical, imposing limits on the size of models that can be trained. Several recent studies have proposed training methods that are less communication intensive, avoiding the need for a highly connected compute cluster. These state-of-the-art low communication training methods still employ a synchronization step for model parameters, which, when performed over all model replicas, can become costly on a low-bandwidth network. In this work, we propose a novel optimization method, NoLoCo, that does not explicitly synchronize all model parameters during training and, as a result, does not require any collective communication. NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum optimizer by partially averaging model weights with a randomly selected other one. We provide both a theoretical convergence analysis for our proposed optimizer as well as empirical results from language model training. We benchmark NoLoCo on a wide range of accelerator counts and model sizes, between 125M to 6.8B parameters. Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo. The synchronization step itself is estimated to be one magnitude faster than the all-reduce used in DiLoCo for few hundred accelerators training over the internet. We also do not have any global blocking communication that reduces accelerator idling time. Compared to DiLoCo, we also observe up to 4% faster convergence rate with wide range of model sizes and accelerator counts.

NoLoCo: 대규모 모델을 위한 No-all-reduce 저통신 학습 방법

NoLoCo: No-all-reduce Low Communication Training Method for Large Models

초록

Support