长上下文注意力基准:从内核效率到分布式上下文并行
Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism
October 19, 2025
作者: Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu
cs.AI
摘要
基于Transformer架构的大语言模型(LLMs)已取得显著成功,但其标准注意力机制在序列长度增加时会产生二次方的计算和内存开销,成为长上下文训练的主要瓶颈。现有研究沿两个方向应对这一挑战:(1)内核级优化,通过加速稠密与稀疏注意力算子实现性能提升;(2)模块级策略,常被称为分布式注意力或上下文并行训练,将注意力计算扩展至多设备。然而系统性评估仍存在局限:算子级对比往往不够全面,而上下文并行策略通常受限于特定框架,缺乏跨场景的清晰性能分析。为填补这些空白,我们提出统一评测基准,通过模块化可扩展接口整合代表性注意力内核与上下文并行机制。该基准从两个关键维度评估方法性能:(1)注意力掩码模式——显著影响效率、可扩展性和可用性;(2)序列长度与分布式规模——决定极端长上下文训练下的表现。通过在96块GPU集群上的综合实验,我们的基准实现了可复现的对比,揭示了特定方法的性能权衡,为长上下文LLM训练中注意力机制的设计与部署提供了实用指导。
English
Transformer-based large language models (LLMs) have achieved remarkable
success, yet their standard attention mechanism incurs quadratic computation
and memory costs with respect to sequence length, posing a major bottleneck for
long-context training. Prior work tackles this challenge along two directions:
(1) kernel-level optimizations, which accelerate dense and sparse attention
operators; and (2) module-level strategies, often referred to as distributed
attention or context parallel training, which scale attention across multiple
devices. However, systematic evaluation still remains limited: operator-level
comparisons are often incomplete, while context parallel strategies are
typically framework-specific, with unclear performance analysis across
contexts. To address these gaps, we propose a unified benchmark that integrates
representative attention kernels and context parallel mechanisms with a modular
and extensible interface for evaluation. The benchmark evaluates methods along
two critical dimensions: (1) attention mask patterns, which strongly affect
efficiency, scalability, and usability, and (2) sequence length and distributed
scale, which determine performance under extreme long-context training. Through
comprehensive experiments on the cluster of up to 96 GPUs, our benchmark
enables reproducible comparisons, highlights method-specific trade-offs, and
provides practical guidance for designing and deploying attention mechanisms in
long-context LLM training.