億級參數大型語言模型訓練的優化網絡架構

摘要

本文挑戰建構用於訓練大型語言模型（LLMs）的任意至任意網路的傳統範式。我們展示LLMs呈現獨特的通訊模式，其中只有少數GPU組需要彼此之間高頻寬的任意至任意通訊，以達到接近最佳的訓練效能。在這些GPU組之間，通訊是微不足道、稀疏且均勻的。我們提出一種新的網路架構，與LLMs的通訊需求密切相似。我們的架構將叢集分割為一組GPU，這些GPU之間通過非阻塞的任意至任意高頻寬互連相連，我們稱之為HB區域。在HB區域之間，網路僅連接具有通訊需求的GPU。我們稱這種網路為“僅軌道”連接，並展示我們提出的架構將網路成本降低高達75％，相較於最先進的任意至任意Clos網路，同時不影響LLM訓練的性能。

English

This paper challenges the well-established paradigm for building any-to-any networks for training Large Language Models (LLMs). We show that LLMs exhibit a unique communication pattern where only small groups of GPUs require high-bandwidth any-to-any communication within them, to achieve near-optimal training performance. Across these groups of GPUs, the communication is insignificant, sparse, and homogeneous. We propose a new network architecture that closely resembles the communication requirement of LLMs. Our architecture partitions the cluster into sets of GPUs interconnected with non-blocking any-to-any high-bandwidth interconnects that we call HB domains. Across the HB domains, the network only connects GPUs with communication demands. We call this network a "rail-only" connection, and show that our proposed architecture reduces the network cost by up to 75% compared to the state-of-the-art any-to-any Clos networks without compromising the performance of LLM training.

億級參數大型語言模型訓練的優化網絡架構

Optimized Network Architectures for Large Language Model Training with Billions of Parameters

摘要

Support