TriAttention: 三角関数を用いた効率的な長文推論とKV圧縮

要旨

大規模言語モデル（LLM）における拡張推論は、KVキャッシュのメモリボトルネックを深刻化させます。主要なKVキャッシュ圧縮手法は、RoPE適用後の最近のクエリの注意力スコアを用いてKVの重要度を推定します。しかし、RoPEにおいてクエリは位置に応じて回転するため、代表的なクエリが極めて少なくなり、トップキーの選択が不適切となり推論が不安定になります。この問題を回避するため、我々はRoPE適用前の空間に着目しました。そこでは、QベクトルとKベクトルが固定された非ゼロの中心周辺に高度に集中し、位置が変わっても安定して存在するという「Q/K集中現象」を観察しました。この集中により、クエリは特定の距離（例えば最近傍キー）のキーを優先的に注目し、中心が三角関数級数を介して優先される距離を決定することを示します。これに基づき、我々はこれらの中心を利用してキー重要度を推定するTriAttentionを提案します。三角関数級数を介して、これらの中心が特徴づける距離選好性を用いてキーの位置に応じたスコアリングを行うと同時に、Q/Kノルムを重要度推定の追加信号として活用します。32Kトークン生成を要するAIME25評価では、TriAttentionはFull Attentionの推論精度を維持しつつ、スループットを2.5倍向上、またはKVメモリを10.7倍削減します。一方、主要なベースラインは同等の効率で精度が約半分に留まります。TriAttentionにより、OpenClawを単一のコンシューマーGPUに展開可能となり、長文コンテキストがFull Attentionではメモリ不足を引き起こす場面でも対応可能となります。

English

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.