稀疏前沿:Transformer大语言模型中的稀疏注意力权衡
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
April 24, 2025
作者: Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti
cs.AI
摘要
稀疏注意力为扩展Transformer大语言模型(LLMs)的长上下文处理能力提供了一种有前景的策略,然而其可行性、效率与准确性的权衡以及系统性的扩展研究仍待探索。为填补这一空白,我们在不同模型规模、序列长度和稀疏度水平上,对无需训练的稀疏注意力方法进行了细致比较,测试范围涵盖了一系列多样化的长序列任务——包括那些依赖自然语言但仍可控且易于评估的新任务。基于实验,我们得出了以下关键发现:1)通过isoFLOPS分析发现,对于极长序列,更大且高度稀疏的模型优于较小且密集的模型。2)在解码阶段,能在统计上保证准确性的稀疏度水平高于预填充阶段,且前者与模型大小相关。3)不存在一种策略在所有任务和阶段均表现最佳,不同场景需要不同的稀疏化单元或预算适应性。即便是中等稀疏度,也常常导致至少一项任务上的显著性能下降,这表明稀疏注意力并非万能解决方案。4)我们提出并验证了专门针对稀疏注意力的新缩放定律,证明我们的发现很可能超越实验范围依然成立。通过这些洞见,我们证明了稀疏注意力是增强Transformer LLMs处理更长序列能力的关键工具,但在性能敏感的应用中需谨慎评估其权衡。
English
Sparse attention offers a promising strategy to extend long-context
capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy
trade-offs, and systematic scaling studies remain unexplored. To address this
gap, we perform a careful comparison of training-free sparse attention methods
at varying model scales, sequence lengths, and sparsity levels on a diverse
collection of long-sequence tasks-including novel ones that rely on natural
language while remaining controllable and easy to evaluate. Based on our
experiments, we report a series of key findings: 1) an isoFLOPS analysis
reveals that for very long sequences, larger and highly sparse models are
preferable to smaller and dense ones. 2) The level of sparsity attainable while
statistically guaranteeing accuracy preservation is higher during decoding than
prefilling, and correlates with model size in the former. 3) There is no clear
strategy that performs best across tasks and phases, with different units of
sparsification or budget adaptivity needed for different scenarios. Even
moderate sparsity levels often result in significant performance degradation on
at least one task, highlighting that sparse attention is not a universal
solution. 4) We introduce and validate novel scaling laws specifically tailored
for sparse attention, providing evidence that our findings are likely to hold
true beyond our range of experiments. Through these insights, we demonstrate
that sparse attention is a key tool to enhance the capabilities of Transformer
LLMs for processing longer sequences, but requires careful evaluation of
trade-offs for performance-sensitive applications.Summary
AI-Generated Summary