MoA: 大規模言語モデルの自動圧縮のためのスパースアテンションの混合

要旨

スパースアテンションは、長文脈における大規模言語モデル（LLM）のメモリとスループットの大幅な要求を効果的に緩和することができます。既存の手法では、通常、均一なスパースアテンションマスクを使用し、異なるアテンションヘッドや入力長に対して同じスパースパターンを適用します。しかし、この均一なアプローチでは、LLMに内在する多様なアテンションパターンを捉えることができず、それらの異なる精度とレイテンシのトレードオフを無視してしまいます。この課題に対処するため、我々はMixture of Attention（MoA）を提案します。MoAは、異なるヘッドやレイヤーに対して個別のスパースアテンション設定を自動的に調整します。MoAは、様々なアテンションパターンとそれらの入力シーケンス長に対するスケーリングルールの探索空間を構築し、ナビゲートします。モデルをプロファイリングし、潜在的な設定を評価し、最適なスパースアテンション圧縮計画を特定します。MoAは、入力サイズの変化に適応し、一部のアテンションヘッドが長いシーケンスに対応するために焦点を拡大する一方で、他のヘッドは固定長のローカルコンテキストに一貫して集中することを明らかにします。実験では、MoAが同じ平均アテンションスパンで有効なコンテキスト長を3.9倍に増加させ、Vicuna-7B、Vicuna-13B、Llama3-8Bモデルにおいて、均一アテンションベースラインに対して検索精度を1.5～7.1倍向上させることが示されました。さらに、MoAはスパースモデルと密モデルの間の能力ギャップを狭め、2つの長文脈理解ベンチマークにおいて、最大の相対的性能低下を9%～36%から5%以内に抑えました。MoAは、7Bおよび13Bの密モデルにおいて、単一GPU上で1.2～1.4倍のGPUメモリ削減を達成し、デコードスループットを5.5～6.7倍向上させ、性能への影響を最小限に抑えました。

English

Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9times with the same average attention span, boosting retrieval accuracy by 1.5-7.1times over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from 9%-36% to within 5% across two long-context understanding benchmarks. MoA achieves a 1.2-1.4times GPU memory reduction and boosts decode throughput by 5.5-6.7 times for 7B and 13B dense models on a single GPU, with minimal impact on performance.

MoA: 大規模言語モデルの自動圧縮のためのスパースアテンションの混合

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

要旨

Support