Sparse Query Attention (SQA): Een computationeel efficiënt aandachtmechanisme met reductie van query heads

Samenvatting

De Transformer-architectuur, ondersteund door het Multi-Head Attention (MHA)-mechanisme, is de facto de standaard geworden voor state-of-the-art modellen in kunstmatige intelligentie. De kwadratische rekencomplexiteit van MHA ten opzichte van de sequentielengte vormt echter een aanzienlijke belemmering voor schaalbaarheid, met name voor toepassingen met lange contexten. Gangbare oplossingen, zoals Multi-Query Attention (MQA) en Grouped-Query Attention (GQA), hebben het geheugenbandbreedteknelpunt dat de latentie van autoregressieve inferentie domineert effectief aangepakt door Key- en Value-projecties te delen. Hoewel zeer succesvol, verminderen deze methoden niet het fundamentele aantal floating-point operations (FLOPs) dat nodig is voor de berekening van de attentiescore, wat een kritiek knelpunt blijft voor training en volledige sequentieverwerking. Dit artikel introduceert Sparse Query Attention (SQA), een nieuwe aandachtarchitectuur die een alternatief en complementair optimalisatiepad volgt. In plaats van het aantal Key/Value-heads te verminderen, vermindert SQA het aantal Query-heads. Deze architectuurwijziging verlaagt direct de rekencomplexiteit van het aandachtmechanisme met een factor die evenredig is aan de reductie in query-heads, waardoor het totale aantal FLOPs wordt verlaagd. Dit werk presenteert de theoretische basis van SQA, de wiskundige formulering ervan en een familie van architectuurvarianten. Empirische benchmarks op lange sequenties (32k-200k tokens) tonen aan dat SQA aanzienlijke doorvoerverbeteringen tot 3x kan bereiken in rekengebonden scenario's zoals modelpretraining, fine-tuning en encoder-gebaseerde taken, met slechts een minimale impact op de modelkwaliteit in voorlopige kleinschalige experimenten. SQA werd toevallig ontdekt tijdens de ontwikkeling van de aankomende Reactive Transformer-architectuur, wat suggereert dat het potentieel heeft als een krachtig hulpmiddel voor het bouwen van efficiëntere en schaalbare modellen.

English

The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with respect to sequence length presents a significant barrier to scaling, particularly for applications involving long contexts. Prevailing solutions, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), have effectively addressed the memory bandwidth bottleneck that dominates autoregressive inference latency by sharing Key and Value projections. While highly successful, these methods do not reduce the fundamental number of floating-point operations (FLOPs) required for the attention score computation, which remains a critical bottleneck for training and full-sequence processing. This paper introduces Sparse Query Attention (SQA), a novel attention architecture that pursues an alternative and complementary optimization path. Instead of reducing Key/Value heads, SQA reduces the number of Query heads. This architectural modification directly decreases the computational complexity of the attention mechanism by a factor proportional to the reduction in query heads, thereby lowering the overall FLOPs. This work presents the theoretical foundation of SQA, its mathematical formulation, and a family of architectural variants. Empirical benchmarks on long sequences (32k-200k tokens) demonstrate that SQA can achieve significant throughput improvements of up to 3x in computation-bound scenarios such as model pre-training, fine-tuning, and encoder-based tasks, with only a minimal impact on model quality in preliminary smallscale experiments. SQA was discovered serendipitously during the development of the upcoming Reactive Transformer architecture, suggesting its potential as a powerful tool for building more efficient and scalable models

Sparse Query Attention (SQA): Een computationeel efficiënt aandachtmechanisme met reductie van query heads

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

Samenvatting

Support