BurstAttention：一种针对极长序列的高效分布式注意力框架

摘要

有效的注意力模块在基于Transformer的大型语言模型（LLMs）的成功中发挥了关键作用，但这些注意力模块的二次时间和内存复杂性在处理长序列时也带来了挑战。解决长序列问题的一个潜在方案是利用分布式集群来并行计算注意力模块跨多个设备（例如，GPU）。然而，采用分布式方法不可避免地会引入额外的内存开销以存储本地注意力结果，并产生额外的通信成本以将本地结果聚合为全局结果。在本文中，我们提出了一个名为“BurstAttention”的分布式注意力框架，以优化全局集群和本地设备级别的内存访问和通信操作。在我们的实验中，我们将BurstAttention与其他竞争性的分布式注意力解决方案进行了比较，用于处理长序列。在不同长度设置下的实验结果表明，与这些竞争性基线相比，BurstAttention在处理长序列时提供了显著优势，减少了40%的通信开销，并在在8个A100上训练32K序列长度时实现了2倍加速。

English

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 2 X speedup during training 32K sequence length on 8 X A100.

BurstAttention：一种针对极长序列的高效分布式注意力框架

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

摘要

Summary

Support

Support