フラックスアテンション：効率的なLLM推論のための文脈対応ハイブリッドアテンション

要旨

標準的なアテンション機構の二次計算複雑性は、長文コンテキストシナリオにおけるLLMの拡張性に対する重大なボトルネックとなっている。完全アテンション（FA）と疎アテンション（SA）を組み合わせたハイブリッドアテンション機構は有望な解決策となるが、既存手法では静的な割り当て比率に依存する場合が多く、様々なタスクの多様な情報取得要求に対応できない。さらに、ヘッド単位の動的疎性化は、計算負荷の不均衡や同期のロングテールを引き起こしやすく、自己回帰復号化時のハードウェア加速を妨げる。この課題を解決するため、本論文では層単位で動的にアテンション計算を最適化する文脈認識フレームワーク「Flux Attention」を提案する。軽量な層ルーターを凍結済み事前学習LLMに組み込むことで、入力コンテキストに基づいて各層をFAまたはSAに適応的に振り分ける。この層単位のルーティングは、高精度な情報取得を維持しつつ連続メモリアクセスを保証し、理論的な計算量削減を実効速度向上に結びつける。パラメータ効率に優れた本手法は、A800 GPU 8台を用いたわずか12時間の学習で実現可能である。複数の長文コンテキストおよび数学的推論ベンチマークにおける大規模実験により、Flux Attentionがベースラインモデルと比較して性能と推論速度の優れたトレードオフを達成することを実証した。特にプリフィル段階では最大2.8倍、復号段階では最大2.0倍の速度向上を実現している。

English

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8timesA800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to 2.8times and 2.0times in the prefill and decode stages.

フラックスアテンション：効率的なLLM推論のための文脈対応ハイブリッドアテンション

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

要旨

Support