スパースクエリ注意機構（Sparse Query Attention, SQA）：クエリヘッド削減による計算効率の高い注意機構

要旨

Transformerアーキテクチャは、Multi-Head Attention（MHA）メカニズムを基盤として、人工知能における最先端モデルのデファクトスタンダードとなっている。しかし、MHAのシーケンス長に対する二次的な計算複雑性は、特に長文脈を扱うアプリケーションにおいて、スケーリングの大きな障壁となっている。既存の解決策であるMulti-Query Attention（MQA）やGrouped-Query Attention（GQA）は、KeyとValueの射影を共有することで、自己回帰推論のレイテンシを支配するメモリ帯域幅のボトルネックを効果的に解消している。これらの手法は非常に成功しているものの、アテンションスコア計算に必要な浮動小数点演算（FLOPs）の根本的な数を削減するものではなく、これはトレーニングや全シーケンス処理における重要なボトルネックとして残っている。本論文では、Sparse Query Attention（SQA）という新しいアテンションアーキテクチャを提案する。SQAは、Key/Valueヘッドを削減する代わりに、Queryヘッドの数を削減する。このアーキテクチャの変更により、アテンションメカニズムの計算複雑性がQueryヘッドの削減に比例して直接的に減少し、全体のFLOPsが低下する。本論文では、SQAの理論的基盤、数学的定式化、およびそのアーキテクチャのバリエーションを提示する。長いシーケンス（32k-200kトークン）における実証的なベンチマークでは、SQAがモデルの事前学習、ファインチューニング、エンコーダベースのタスクなどの計算ボトルネックシナリオにおいて、最大3倍のスループット向上を達成できることが示されており、小規模な予備実験ではモデルの品質に最小限の影響しか及ぼさないことが確認されている。SQAは、今後のReactive Transformerアーキテクチャの開発中に偶然発見されたものであり、より効率的でスケーラブルなモデルを構築するための強力なツールとしての可能性を示唆している。

English

The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with respect to sequence length presents a significant barrier to scaling, particularly for applications involving long contexts. Prevailing solutions, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), have effectively addressed the memory bandwidth bottleneck that dominates autoregressive inference latency by sharing Key and Value projections. While highly successful, these methods do not reduce the fundamental number of floating-point operations (FLOPs) required for the attention score computation, which remains a critical bottleneck for training and full-sequence processing. This paper introduces Sparse Query Attention (SQA), a novel attention architecture that pursues an alternative and complementary optimization path. Instead of reducing Key/Value heads, SQA reduces the number of Query heads. This architectural modification directly decreases the computational complexity of the attention mechanism by a factor proportional to the reduction in query heads, thereby lowering the overall FLOPs. This work presents the theoretical foundation of SQA, its mathematical formulation, and a family of architectural variants. Empirical benchmarks on long sequences (32k-200k tokens) demonstrate that SQA can achieve significant throughput improvements of up to 3x in computation-bound scenarios such as model pre-training, fine-tuning, and encoder-based tasks, with only a minimal impact on model quality in preliminary smallscale experiments. SQA was discovered serendipitously during the development of the upcoming Reactive Transformer architecture, suggesting its potential as a powerful tool for building more efficient and scalable models

スパースクエリ注意機構（Sparse Query Attention, SQA）：クエリヘッド削減による計算効率の高い注意機構

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

要旨

Support