ZeCO:面向线性注意力的零通信开销序列并行机制
ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention
July 1, 2025
作者: Yuhong Chou, Zehao Liu, Ruijie Zhu, Xinyi Wan, Tianjian Li, Congying Chu, Qian Liu, Jibin Wu, Zejun Ma
cs.AI
摘要
线性注意力机制为大型语言模型(LLMs)带来了显著优势,通过提供线性计算复杂度,实现了超长序列(例如,100万上下文)的高效处理。然而,现有的序列并行(SP)方法,对于跨设备分配这些工作负载至关重要,却因巨大的通信开销成为主要瓶颈。本文提出了一种针对线性注意力模型的零通信开销序列并行方法——ZeCO,旨在克服这些限制,实现长序列训练的端到端近线性扩展。例如,在64台设备上使用ZeCO训练一个100万序列长度的模型,所需时间与在单台设备上训练16k序列大致相同。ZeCO的核心在于All-Scan,一种新的集体通信原语。All-Scan为每个SP等级提供其所需的初始操作符状态,同时保持最小的通信足迹,有效消除了通信开销。理论上,我们证明了ZeCO的最优性,表明其仅引入可忽略的时间和空间开销。实证上,我们比较了不同序列并行策略的通信成本,并证明All-Scan在SP场景中实现了最快的通信。具体而言,在256个GPU上处理800万序列长度时,ZeCO相比当前最先进的SP方法实现了60%的加速。我们相信,ZeCO为在以往难以处理的序列长度上高效训练下一代LLMs开辟了一条清晰的道路。
English
Linear attention mechanisms deliver significant advantages for Large Language
Models (LLMs) by providing linear computational complexity, enabling efficient
processing of ultra-long sequences (e.g., 1M context). However, existing
Sequence Parallelism (SP) methods, essential for distributing these workloads
across devices, become the primary bottleneck due to substantial communication
overhead. In this paper, we introduce ZeCO (Zero Communication Overhead)
sequence parallelism for linear attention models, a new SP method designed to
overcome these limitations and achieve end-to-end near-linear scalability for
long sequence training. For example, training a model with a 1M sequence length
across 64 devices using ZeCO takes roughly the same time as training with an
16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new
collective communication primitive. All-Scan provides each SP rank with
precisely the initial operator state it requires while maintaining a minimal
communication footprint, effectively eliminating communication overhead.
Theoretically, we prove the optimaity of ZeCO, showing that it introduces only
negligible time and space overhead. Empirically, we compare the communication
costs of different sequence parallelism strategies and demonstrate that
All-Scan achieves the fastest communication in SP scenarios. Specifically, on
256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to
the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a
clear path toward efficiently training next-generation LLMs on previously
intractable sequence lengths.