長上下文注意力基準測試:從內核效率到分散式上下文並行
Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism
October 19, 2025
作者: Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu
cs.AI
摘要
基於Transformer架構的大型語言模型(LLMs)雖已取得顯著成就,但其標準注意力機制會隨序列長度產生二次方計算量與記憶體開銷,成為長上下文訓練的主要瓶頸。現有研究沿兩個方向應對此挑戰:(1)核心層級優化,旨在加速稠密與稀疏注意力運算元;(2)模組層級策略(常稱為分散式注意力或上下文並行訓練),通過多設備擴展注意力計算。然而,系統性評估仍存在侷限性:運算元級別對比往往不夠全面,而上下文並行策略通常受框架限制,且缺乏跨情境的清晰效能分析。為填補這些空白,我們提出一個統一基準測試框架,整合代表性注意力核心與上下文並行機制,並提供模組化可擴展的評估介面。該基準從兩個關鍵維度評估方法:(1)注意力遮罩模式(顯著影響效率、可擴展性與實用性);(2)序列長度與分散式規模(決定極端長上下文訓練下的效能)。通過在最多96張GPU的叢集上進行全面實驗,我們的基準框架實現了可重現的對比,揭示了各類方法的特定權衡,並為長上下文LLM訓練中的注意力機制設計與部署提供實用指引。
English
Transformer-based large language models (LLMs) have achieved remarkable
success, yet their standard attention mechanism incurs quadratic computation
and memory costs with respect to sequence length, posing a major bottleneck for
long-context training. Prior work tackles this challenge along two directions:
(1) kernel-level optimizations, which accelerate dense and sparse attention
operators; and (2) module-level strategies, often referred to as distributed
attention or context parallel training, which scale attention across multiple
devices. However, systematic evaluation still remains limited: operator-level
comparisons are often incomplete, while context parallel strategies are
typically framework-specific, with unclear performance analysis across
contexts. To address these gaps, we propose a unified benchmark that integrates
representative attention kernels and context parallel mechanisms with a modular
and extensible interface for evaluation. The benchmark evaluates methods along
two critical dimensions: (1) attention mask patterns, which strongly affect
efficiency, scalability, and usability, and (2) sequence length and distributed
scale, which determine performance under extreme long-context training. Through
comprehensive experiments on the cluster of up to 96 GPUs, our benchmark
enables reproducible comparisons, highlights method-specific trade-offs, and
provides practical guidance for designing and deploying attention mechanisms in
long-context LLM training.