弾性アテンション：効率的なトランスフォーマーのためのテスト時適応的スパース性比率

要旨

標準的なアテンション機構の二次計算複雑性は、長文コンテキストにおける大規模言語モデル（LLM）のスケーラビリティにおける重大なボトルネックとなっている。疎密混合アテンション戦略は単一モデル内で疎アテンションと密アテンションを組み合わせる有効な解決策を提供するが、一般的には静的な計算比率（すなわち疎アテンションと密アテンションの固定割合）を採用し、推論時に下流タスクが持つ様々な疎性感受性に適応できない。この課題を解決するため、我々は入力に基づいてモデルが全体の疎性を動的に調整可能なElastic Attentionを提案する。これは既存の事前学習モデルに軽量なAttention Routerを統合し、各アテンションヘッドを動的に異なる計算モードに割り当てることで実現される。8xA800 GPUでわずか12時間の学習により、本手法はモデルが高性能と効率的な推論の両立を可能にする。広く利用されているLLMを用いた3つの長文コンテキストベンチマークにおける実験により、本手法の優位性が実証された。

English

The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.

弾性アテンション：効率的なトランスフォーマーのためのテスト時適応的スパース性比率

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

要旨

Support