Sparse Query Attention (SQA): 쿼리 헤드 감소를 통한 계산 효율적인 어텐션 메커니즘

초록

멀티헤드 어텐션(MHA) 메커니즘을 기반으로 한 트랜스포머 아키텍처는 인공지능 분야에서 최첨단 모델의 사실상 표준이 되었습니다. 그러나 MHA의 시퀀스 길이에 대한 2차 계산 복잡성은, 특히 긴 문맥을 다루는 응용 분야에서 확장성에 있어 상당한 장벽으로 작용합니다. 기존의 해결책들, 예를 들어 멀티쿼리 어텐션(MQA)과 그룹화된 쿼리 어텐션(GQA)은 키와 값 프로젝션을 공유함으로써 자기회귀 추론 지연 시간을 지배하는 메모리 대역폭 병목 현상을 효과적으로 해결했습니다. 이러한 방법들은 매우 성공적이었지만, 어텐션 점수 계산에 필요한 기본적인 부동소수점 연산(FLOP) 수를 줄이지는 못하며, 이는 여전히 훈련 및 전체 시퀀스 처리에서 중요한 병목 현상으로 남아 있습니다. 본 논문은 새로운 어텐션 아키텍처인 희소 쿼리 어텐션(SQA)을 소개하며, 이는 대안적이고 보완적인 최적화 경로를 추구합니다. SQA는 키/값 헤드를 줄이는 대신 쿼리 헤드의 수를 줄입니다. 이러한 아키텍처적 수정은 쿼리 헤드 감소에 비례하여 어텐션 메커니즘의 계산 복잡성을 직접적으로 감소시켜, 전체 FLOP를 낮춥니다. 이 연구는 SQA의 이론적 기반, 수학적 공식화, 그리고 다양한 아키텍처 변형군을 제시합니다. 긴 시퀀스(32k-200k 토큰)에 대한 실험적 벤치마크는 SQA가 모델 사전 훈련, 미세 조정, 인코더 기반 작업과 같은 계산 집약적인 시나리오에서 최대 3배의 처리량 향상을 달성할 수 있음을 보여주며, 소규모 실험에서는 모델 품질에 미치는 영향이 최소임을 입증합니다. SQA는 향후 출시될 Reactive Transformer 아키텍처 개발 과정에서 우연히 발견되었으며, 이는 더 효율적이고 확장 가능한 모델 구축을 위한 강력한 도구로서의 잠재력을 시사합니다.

English

The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with respect to sequence length presents a significant barrier to scaling, particularly for applications involving long contexts. Prevailing solutions, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), have effectively addressed the memory bandwidth bottleneck that dominates autoregressive inference latency by sharing Key and Value projections. While highly successful, these methods do not reduce the fundamental number of floating-point operations (FLOPs) required for the attention score computation, which remains a critical bottleneck for training and full-sequence processing. This paper introduces Sparse Query Attention (SQA), a novel attention architecture that pursues an alternative and complementary optimization path. Instead of reducing Key/Value heads, SQA reduces the number of Query heads. This architectural modification directly decreases the computational complexity of the attention mechanism by a factor proportional to the reduction in query heads, thereby lowering the overall FLOPs. This work presents the theoretical foundation of SQA, its mathematical formulation, and a family of architectural variants. Empirical benchmarks on long sequences (32k-200k tokens) demonstrate that SQA can achieve significant throughput improvements of up to 3x in computation-bound scenarios such as model pre-training, fine-tuning, and encoder-based tasks, with only a minimal impact on model quality in preliminary smallscale experiments. SQA was discovered serendipitously during the development of the upcoming Reactive Transformer architecture, suggesting its potential as a powerful tool for building more efficient and scalable models

Sparse Query Attention (SQA): 쿼리 헤드 감소를 통한 계산 효율적인 어텐션 메커니즘

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

초록

Support