Adamas: 効率的な長文脈推論のためのアダマール疎注意機構

要旨

大規模言語モデル（LLM）は現在、数十万から数百万トークンに及ぶコンテキストウィンドウをサポートし、長文書要約、大規模コード合成、複数文書にわたる質問応答、持続的なマルチターン対話などの応用を可能にしている。しかし、このように拡張されたコンテキストは自己注意機構の二次コストを悪化させ、自己回帰的なデコードにおける深刻な遅延を引き起こす。既存のスパース注意メカニズムはこれらのコストを軽減するが、ヒューリスティックなパターンに依存しており、各クエリに対する重要なキー・バリューペアの呼び出しに課題を抱え、精度低下を招く。本論文では、長文コンテキスト推論向けに設計された軽量かつ高精度なスパース注意メカニズム「Adamas」を提案する。Adamasはアダマール変換、バケット化、2ビット圧縮を適用してコンパクトな表現を生成し、マンハッタン距離推定を活用した効率的なトップk選択を実現する。実験結果では、Adamasが64トークンの予算で完全な注意機構と同等の精度を達成し、128トークンではほぼロスレスの性能を発揮する。さらに、従来の最先端手法と比較して最大8倍のスパース性を実現し、32Kトークン長のシーケンスにおいて自己注意処理で最大4.4倍、エンドツーエンド処理で最大1.5倍の高速化を達成する。特筆すべきは、Adamasが完全な注意機構と同等あるいはそれ以下のパープレキシティを達成し、積極的なスパース化条件下でも精度維持の有効性を実証している点である。

English

Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.

Adamas: 効率的な長文脈推論のためのアダマール疎注意機構

Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

要旨

Support