稀疏查詢注意力機制（SQA）：一種通過減少查詢頭實現計算效率提升的注意力機制

摘要

Transformer架構，以多頭注意力機制（Multi-Head Attention, MHA）為核心，已成為人工智慧領域頂尖模型的實際標準。然而，MHA相對於序列長度的二次計算複雜性，尤其是在涉及長上下文應用的場景中，構成了顯著的擴展障礙。現有的解決方案，如多查詢注意力（Multi-Query Attention, MQA）和分組查詢注意力（Grouped-Query Attention, GQA），通過共享鍵（Key）和值（Value）投影，有效緩解了自迴歸推理延遲中佔主導地位的記憶體頻寬瓶頸。儘管這些方法取得了巨大成功，但它們並未減少注意力分數計算所需的基本浮點運算次數（FLOPs），這仍然是訓練和全序列處理中的關鍵瓶頸。本文提出了稀疏查詢注意力（Sparse Query Attention, SQA），這是一種新穎的注意力架構，探索了一條替代且互補的優化路徑。SQA並非減少鍵/值頭，而是減少查詢頭的數量。這一架構修改直接按比例降低了注意力機制的計算複雜度，從而減少了總體FLOPs。本研究闡述了SQA的理論基礎、數學公式化表達以及一系列架構變體。在長序列（32k至200k個標記）上的實證基準測試表明，在模型預訓練、微調及基於編碼器的任務等計算受限的場景中，SQA可實現高達3倍的吞吐量提升，而在初步的小規模實驗中對模型質量的影響微乎其微。SQA是在開發即將問世的反應式Transformer架構過程中意外發現的，這表明其作為構建更高效、可擴展模型的強大工具的潛力。

English

The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with respect to sequence length presents a significant barrier to scaling, particularly for applications involving long contexts. Prevailing solutions, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), have effectively addressed the memory bandwidth bottleneck that dominates autoregressive inference latency by sharing Key and Value projections. While highly successful, these methods do not reduce the fundamental number of floating-point operations (FLOPs) required for the attention score computation, which remains a critical bottleneck for training and full-sequence processing. This paper introduces Sparse Query Attention (SQA), a novel attention architecture that pursues an alternative and complementary optimization path. Instead of reducing Key/Value heads, SQA reduces the number of Query heads. This architectural modification directly decreases the computational complexity of the attention mechanism by a factor proportional to the reduction in query heads, thereby lowering the overall FLOPs. This work presents the theoretical foundation of SQA, its mathematical formulation, and a family of architectural variants. Empirical benchmarks on long sequences (32k-200k tokens) demonstrate that SQA can achieve significant throughput improvements of up to 3x in computation-bound scenarios such as model pre-training, fine-tuning, and encoder-based tasks, with only a minimal impact on model quality in preliminary smallscale experiments. SQA was discovered serendipitously during the development of the upcoming Reactive Transformer architecture, suggesting its potential as a powerful tool for building more efficient and scalable models

稀疏查詢注意力機制（SQA）：一種通過減少查詢頭實現計算效率提升的注意力機制

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

摘要

Support