**TriAttention:基於三角函數鍵值壓縮的高效長序列推理**
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
April 6, 2026
作者: Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen
cs.AI
摘要
大型語言模型(LLM)中的擴展推理會產生嚴重的KV快取記憶體瓶頸。主流的KV快取壓縮方法通常透過近期後RoPE查詢的注意力分數來估算KV重要性。然而在RoPE過程中,查詢會隨位置旋轉,導致具代表性的查詢數量極少,進而造成頂級鍵選擇效果不佳與推理不穩定。為解決此問題,我們轉向研究前RoPE空間,在此觀察到Q和K向量高度聚集於固定的非零中心點,且在不同位置保持穩定——此即Q/K集中現象。我們證明這種集中特性會使查詢優先關注特定距離的鍵(例如最鄰近鍵),而中心點透過三角級數決定偏好的距離區間。據此,我們提出TriAttention方法,利用這些中心點來估算鍵的重要性。透過三角級數,我們運用中心點表徵的距離偏好來根據鍵的位置進行評分,同時結合Q/K範數作為重要性估算的輔助信號。在生成32K詞元的AIME25測試中,TriAttention在達成2.5倍吞吐量提升或10.7倍KV記憶體壓縮的同時,仍能保持與完整注意力相當的推理準確度,而主流基準方法在同等效率下僅能達到約一半準確度。TriAttention使得OpenClaw模型能部署於單張消費級GPU,若使用完整注意力機制,長上下文原本會導致記憶體不足問題。
English
Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.