BurstAttention：一個針對極長序列的高效分佈式注意力框架

摘要

有效的注意力模組在基於Transformer的大型語言模型（LLMs）的成功中扮演了關鍵角色，但這些注意力模組的二次時間和記憶體複雜度在處理長序列時也帶來挑戰。解決長序列問題的一種潛在方案是利用分散式集群來將注意力模組的計算在多個設備（例如GPU）上進行並行化。然而，採用分散式方法不可避免地會引入額外的記憶體開銷以存儲本地注意力結果，並產生額外的通信成本以將本地結果聚合為全局結果。在本文中，我們提出了一個名為“BurstAttention”的分散式注意力框架，以優化在全局集群和本地設備層面的記憶體訪問和通信操作。在我們的實驗中，我們將BurstAttention與其他競爭性的分散式注意力解決方案進行比較，用於處理長序列。在不同長度設置下的實驗結果表明，與這些競爭基準相比，BurstAttention在處理長序列時提供了顯著的優勢，減少了40%的通信開銷，在8 X A100上訓練32K序列長度時實現了2倍加速。

English

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 2 X speedup during training 32K sequence length on 8 X A100.

BurstAttention：一個針對極長序列的高效分佈式注意力框架

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

摘要

Support