具有数十亿参数的大型语言模型训练的优化网络架构
Optimized Network Architectures for Large Language Model Training with Billions of Parameters
July 22, 2023
作者: Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani
cs.AI
摘要
本文挑战了为训练大型语言模型(LLMs)构建任意到任意网络的成熟范式。我们展示了LLMs表现出独特的通信模式,其中只有少量GPU组需要它们之间的高带宽任意到任意通信,以实现接近最佳的训练性能。在这些GPU组中,通信是微不足道的、稀疏的和均匀的。我们提出了一种新的网络架构,它与LLMs的通信需求密切相关。我们的架构将集群分成一组与非阻塞任意到任意高带宽互连相连的GPU集合,我们称之为HB域。在HB域之间,网络仅连接具有通信需求的GPU。我们将这种网络称为“仅轨道”连接,并展示了我们提出的架构将网络成本降低了高达75%,而不会影响LLMs训练的性能,相比之下,与最先进的任意到任意Clos网络相比。
English
This paper challenges the well-established paradigm for building any-to-any
networks for training Large Language Models (LLMs). We show that LLMs exhibit a
unique communication pattern where only small groups of GPUs require
high-bandwidth any-to-any communication within them, to achieve near-optimal
training performance. Across these groups of GPUs, the communication is
insignificant, sparse, and homogeneous. We propose a new network architecture
that closely resembles the communication requirement of LLMs. Our architecture
partitions the cluster into sets of GPUs interconnected with non-blocking
any-to-any high-bandwidth interconnects that we call HB domains. Across the HB
domains, the network only connects GPUs with communication demands. We call
this network a "rail-only" connection, and show that our proposed architecture
reduces the network cost by up to 75% compared to the state-of-the-art
any-to-any Clos networks without compromising the performance of LLM training.