Elastische Aandacht: Testtijd-adaptieve sparsiteitsverhoudingen voor efficiënte transformers

Samenvatting

De kwadratische complexiteit van standaard aandachtmechanismen vormt een belangrijke schaalbaarheidsbeperking voor grote taalmodellen (LLM's) in scenario's met lange context. Hoewel hybride aandachtstrategieën die sparse en volledige aandacht binnen één model combineren een haalbare oplossing bieden, gebruiken deze doorgaans statische rekenverhoudingen (d.w.z. vaste verhoudingen tussen sparse en volledige aandacht) en passen ze niet aan op de uiteenlopende sparsiteitsgevoeligheden van downstreamtaken tijdens inferentie. Om dit probleem aan te pakken, stellen wij Elastic Attention voor, waardoor het model zijn algehele sparsiteit dynamisch kan aanpassen op basis van de input. Dit wordt bereikt door een lichtgewicht Attention Router te integreren in het bestaande voorgetrainde model, die elke aandachtskop dynamisch toewijst aan verschillende rekenmodi. Met slechts 12 uur training op 8xA800 GPU's stelt onze methode modellen in staat om zowel sterke prestaties als efficiënte inferentie te bereiken. Experimenten op drie lange-context benchmarks met veelgebruikte LLM's tonen de superioriteit van onze methode aan.

English

The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.

Elastische Aandacht: Testtijd-adaptieve sparsiteitsverhoudingen voor efficiënte transformers

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

Samenvatting

Support