ZeCO: 선형 어텐션을 위한 제로 통신 오버헤드 시퀀스 병렬화

초록

선형 어텐션 메커니즘은 선형 계산 복잡도를 제공함으로써 대규모 언어 모델(LLM)에 상당한 이점을 제공하며, 초장기 시퀀스(예: 100만 컨텍스트)의 효율적인 처리를 가능하게 합니다. 그러나 이러한 작업 부하를 여러 장치에 분산시키는 데 필수적인 기존의 시퀀스 병렬화(SP) 방법은 상당한 통신 오버헤드로 인해 주요 병목 현상이 되고 있습니다. 본 논문에서는 이러한 한계를 극복하고 장기 시퀀스 훈련을 위한 종단 간 근선형 확장성을 달성하기 위해 선형 어텐션 모델을 위한 제로 통신 오버헤드(ZeCO) 시퀀스 병렬화라는 새로운 SP 방법을 소개합니다. 예를 들어, 64개 장치에서 100만 시퀀스 길이의 모델을 ZeCO로 훈련하는 데 걸리는 시간은 단일 장치에서 16k 시퀀스로 훈련하는 시간과 거의 동일합니다. ZeCO의 핵심에는 All-Scan이라는 새로운 집단 통신 프리미티브가 있습니다. All-Scan은 각 SP 랭크에 필요한 초기 연산자 상태를 정확히 제공하면서도 최소한의 통신 비용을 유지하여 통신 오버헤드를 효과적으로 제거합니다. 이론적으로, 우리는 ZeCO의 최적성을 증명하며, 이 방법이 미미한 시간 및 공간 오버헤드만을 도입함을 보여줍니다. 실험적으로, 우리는 다양한 시퀀스 병렬화 전략의 통신 비용을 비교하고 All-Scan이 SP 시나리오에서 가장 빠른 통신을 달성함을 입증합니다. 특히, 256개의 GPU에서 800만 시퀀스 길이로 ZeCO를 사용할 경우, 현재 최신(SoTA) SP 방법 대비 60%의 속도 향상을 달성합니다. 우리는 ZeCO가 이전에는 다루기 어려웠던 시퀀스 길이에서 차세대 LLM을 효율적으로 훈련하기 위한 명확한 경로를 제시한다고 믿습니다.

English

Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new collective communication primitive. All-Scan provides each SP rank with precisely the initial operator state it requires while maintaining a minimal communication footprint, effectively eliminating communication overhead. Theoretically, we prove the optimaity of ZeCO, showing that it introduces only negligible time and space overhead. Empirically, we compare the communication costs of different sequence parallelism strategies and demonstrate that All-Scan achieves the fastest communication in SP scenarios. Specifically, on 256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a clear path toward efficiently training next-generation LLMs on previously intractable sequence lengths.

ZeCO: 선형 어텐션을 위한 제로 통신 오버헤드 시퀀스 병렬화

ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

초록

Support