통신 효율적인 언어 모델 학습은 신뢰성과 견고성을 갖춘 확장 가능: DiLoCo의 확장 법칙

초록

더 대규모의 머신러닝 모델로 확장함에 따라, 데이터 병렬 접근 방식에 내재된 빈번한 동기화 요구는 상당한 속도 저하를 초래하며, 추가적인 확장에 있어 중요한 과제로 대두됩니다. 최근 연구에서는 모델 품질을 저하시키지 않으면서 동기화 요구를 완화하는 접근 방식(DiLoCo)을 개발했습니다. 그러나 이러한 연구들은 DiLoCo의 동작이 모델 크기에 따라 어떻게 변화하는지를 면밀히 분석하지 않았습니다. 본 연구에서는 고정된 컴퓨팅 예산 하에서 대규모 언어 모델(LLM)을 훈련할 때 DiLoCo의 스케일링 법칙 행동을 연구합니다. 우리는 모델 복제본 수, 하이퍼파라미터, 토큰 예산을 포함한 알고리즘적 요소들이 스케일링 법칙을 통해 정확히 예측 가능한 방식으로 훈련에 미치는 영향에 초점을 맞춥니다. 우리는 DiLoCo가 모델 크기에 따라 예측 가능하고 견고하게 확장됨을 발견했습니다. 잘 조정된 경우, DiLoCo는 데이터 병렬 훈련보다 모델 크기에 따라 더 나은 확장성을 보이며, 작은 모델 크기에서도 데이터 병렬 훈련을 능가할 수 있습니다. 우리의 결과는 이전에 문서화된 것보다 더 일반적인 DiLoCo의 이점을 보여주는데, 이는 증가된 최적 배치 크기, 규모에 따른 개선된 다운스트림 일반화, 그리고 고정된 토큰 예산에 대한 개선된 평가 손실을 포함합니다.

English

As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.

통신 효율적인 언어 모델 학습은 신뢰성과 견고성을 갖춘 확장 가능: DiLoCo의 확장 법칙

Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

초록

Support