BurstAttention:一個針對極長序列的高效分佈式注意力框架
BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
March 14, 2024
作者: Sun Ao, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong Sun, Shengnan Wang, Teng Su
cs.AI
摘要
有效的注意力模組在基於Transformer的大型語言模型(LLMs)的成功中扮演了關鍵角色,但這些注意力模組的二次時間和記憶體複雜度在處理長序列時也帶來挑戰。解決長序列問題的一種潛在方案是利用分散式集群來將注意力模組的計算在多個設備(例如GPU)上進行並行化。然而,採用分散式方法不可避免地會引入額外的記憶體開銷以存儲本地注意力結果,並產生額外的通信成本以將本地結果聚合為全局結果。在本文中,我們提出了一個名為“BurstAttention”的分散式注意力框架,以優化在全局集群和本地設備層面的記憶體訪問和通信操作。在我們的實驗中,我們將BurstAttention與其他競爭性的分散式注意力解決方案進行比較,用於處理長序列。在不同長度設置下的實驗結果表明,與這些競爭基準相比,BurstAttention在處理長序列時提供了顯著的優勢,減少了40%的通信開銷,在8 X A100上訓練32K序列長度時實現了2倍加速。
English
Effective attention modules have played a crucial role in the success of
Transformer-based large language models (LLMs), but the quadratic time and
memory complexities of these attention modules also pose a challenge when
processing long sequences. One potential solution for the long sequence problem
is to utilize distributed clusters to parallelize the computation of attention
modules across multiple devices (e.g., GPUs). However, adopting a distributed
approach inevitably introduces extra memory overheads to store local attention
results and incurs additional communication costs to aggregate local results
into global ones. In this paper, we propose a distributed attention framework
named ``BurstAttention'' to optimize memory access and communication operations
at both the global cluster and local device levels. In our experiments, we
compare BurstAttention with other competitive distributed attention solutions
for long sequence processing. The experimental results under different length
settings demonstrate that BurstAttention offers significant advantages for
processing long sequences compared with these competitive baselines, reducing
40% communication overheads and achieving 2 X speedup during training 32K
sequence length on 8 X A100.Summary
AI-Generated Summary