TriAttention:基于三角核压缩的高效长序列推理
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
April 6, 2026
作者: Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen
cs.AI
摘要
大型语言模型(LLM)中的扩展推理会引发严重的KV缓存内存瓶颈。主流的KV缓存压缩方法通过最近的后RoPE查询的注意力分数来估计KV重要性。然而,查询在RoPE过程中会随位置旋转,导致代表性查询极少,进而造成关键键选择效果差且推理不稳定。为规避此问题,我们转向预RoPE空间,在此观察到Q和K向量高度集中于固定的非零点附近,且在不同位置上保持稳定——即Q/K集中现象。我们证明这种集中会导致查询优先关注特定距离的键(如最近邻键),其集中中心通过三角级数决定偏好的距离。基于此,我们提出TriAttention方法,利用这些集中中心来估计键重要性。通过三角级数,我们使用集中中心表征的距离偏好对键进行位置评分,并同时利用Q/K范数作为重要性估计的辅助信号。在生成32K令牌的AIME25任务中,TriAttention在实现2.5倍吞吐量提升或10.7倍KV内存压缩的同时,匹配了全注意力机制的推理精度,而主流基线方法在同等效率下仅能达到约一半的精度。TriAttention使得OpenClaw模型可部署于单张消费级GPU,而长上下文场景下若采用全注意力机制则会导致内存溢出。
English
Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.