수십억 개의 파라미터를 가진 대규모 언어 모델 훈련을 위한 최적화된 네트워크 아키텍처

초록

본 논문은 대규모 언어 모델(LLM) 훈련을 위한 any-to-any 네트워크 구축의 기존 패러다임에 도전한다. 우리는 LLM이 고유한 통신 패턴을 보이며, 최적에 가까운 훈련 성능을 달성하기 위해 소규모 GPU 그룹 내에서만 고대역폭 any-to-any 통신이 필요하다는 것을 보여준다. 이러한 GPU 그룹 간의 통신은 미미하고, 희소하며, 균일하다. 우리는 LLM의 통신 요구사항과 밀접하게 일치하는 새로운 네트워크 아키텍처를 제안한다. 우리의 아키텍처는 클러스터를 비차단적 any-to-any 고대역폭 상호 연결로 연결된 GPU 집합으로 분할하며, 이를 HB 도메인이라고 부른다. HB 도메인 간에는 통신 수요가 있는 GPU만 연결된다. 우리는 이를 "rail-only" 연결이라고 부르며, 제안된 아키텍처가 최신 any-to-any Clos 네트워크 대비 네트워크 비용을 최대 75%까지 절감하면서도 LLM 훈련 성능을 저하시키지 않음을 보여준다.

English

This paper challenges the well-established paradigm for building any-to-any networks for training Large Language Models (LLMs). We show that LLMs exhibit a unique communication pattern where only small groups of GPUs require high-bandwidth any-to-any communication within them, to achieve near-optimal training performance. Across these groups of GPUs, the communication is insignificant, sparse, and homogeneous. We propose a new network architecture that closely resembles the communication requirement of LLMs. Our architecture partitions the cluster into sets of GPUs interconnected with non-blocking any-to-any high-bandwidth interconnects that we call HB domains. Across the HB domains, the network only connects GPUs with communication demands. We call this network a "rail-only" connection, and show that our proposed architecture reduces the network cost by up to 75% compared to the state-of-the-art any-to-any Clos networks without compromising the performance of LLM training.

수십억 개의 파라미터를 가진 대규모 언어 모델 훈련을 위한 최적화된 네트워크 아키텍처

Optimized Network Architectures for Large Language Model Training with Billions of Parameters

초록

Support