稀疏查询注意力机制（SQA）：一种通过减少查询头实现计算高效的注意力机制

摘要

基于多头注意力机制（Multi-Head Attention, MHA）的Transformer架构，已成为人工智能领域顶尖模型的实际标准。然而，MHA相对于序列长度的二次计算复杂度，在涉及长上下文的应用中构成了显著的扩展障碍。当前的主流解决方案，如多查询注意力（Multi-Query Attention, MQA）和分组查询注意力（Grouped-Query Attention, GQA），通过共享键（Key）和值（Value）投影，有效缓解了自回归推理延迟中占据主导地位的内存带宽瓶颈。尽管这些方法取得了显著成效，但它们并未减少注意力得分计算所需的基本浮点运算次数（FLOPs），这仍是训练和全序列处理的关键瓶颈。本文提出了一种新颖的注意力架构——稀疏查询注意力（Sparse Query Attention, SQA），它探索了一条替代且互补的优化路径。不同于减少键/值头，SQA减少了查询头的数量。这一架构修改直接按查询头减少的比例降低了注意力机制的计算复杂度，从而减少了总体FLOPs。本研究阐述了SQA的理论基础、数学公式化表达及一系列架构变体。在长序列（32k至200k个标记）上的实证基准测试表明，在模型预训练、微调和基于编码器的任务等计算受限场景中，SQA能实现高达3倍的吞吐量提升，而在初步的小规模实验中，对模型质量的影响微乎其微。SQA是在开发即将问世的反应式Transformer架构过程中偶然发现的，这暗示了其作为构建更高效、可扩展模型的强大工具的潜力。

English

The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with respect to sequence length presents a significant barrier to scaling, particularly for applications involving long contexts. Prevailing solutions, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), have effectively addressed the memory bandwidth bottleneck that dominates autoregressive inference latency by sharing Key and Value projections. While highly successful, these methods do not reduce the fundamental number of floating-point operations (FLOPs) required for the attention score computation, which remains a critical bottleneck for training and full-sequence processing. This paper introduces Sparse Query Attention (SQA), a novel attention architecture that pursues an alternative and complementary optimization path. Instead of reducing Key/Value heads, SQA reduces the number of Query heads. This architectural modification directly decreases the computational complexity of the attention mechanism by a factor proportional to the reduction in query heads, thereby lowering the overall FLOPs. This work presents the theoretical foundation of SQA, its mathematical formulation, and a family of architectural variants. Empirical benchmarks on long sequences (32k-200k tokens) demonstrate that SQA can achieve significant throughput improvements of up to 3x in computation-bound scenarios such as model pre-training, fine-tuning, and encoder-based tasks, with only a minimal impact on model quality in preliminary smallscale experiments. SQA was discovered serendipitously during the development of the upcoming Reactive Transformer architecture, suggesting its potential as a powerful tool for building more efficient and scalable models

稀疏查询注意力机制（SQA）：一种通过减少查询头实现计算高效的注意力机制

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

摘要

Support