TriAttention: 삼각 함수 KV 압축을 통한 효율적인 장거리 추론

초록

대규모 언어 모델(LLM)의 확장 추론은 심각한 KV 캐시 메모리 병목 현상을 초래합니다. 선도적인 KV 캐시 압축 방법들은 최근의 후-로프(RoPE) 쿼리들에서 어텐션 점수를 사용하여 KV 중요도를 추정합니다. 그러나 쿼리들은 RoPE 동안 위치에 따라 회전하므로, 대표성이 높은 쿼리가 매우 적어져 열쇠(key) 선별이 부정확하고 추론이 불안정해집니다. 이 문제를 피하기 위해 우리는 로프 적용 전(pre-RoPE) 공간으로 주목했으며, 여기서 Q와 K 벡터들이 고정된 0이 아닌 중심 주위에 높게 집중되고 위치에 관계없이 안정적으로 유지된다는 점(Q/K 집중 현상)을 관찰했습니다. 우리는 이 집중 현상이 쿼리들이 특정 거리(예: 가장 가까운 키)에 있는 키들에 우선적으로 주의를 기울이게 하며, 중심들이 삼각함수 급수를 통해 어느 거리가 선호되는지를 결정함을 보여줍니다. 이를 바탕으로, 우리는 이러한 중심들을 활용하여 키 중요도를 추정하는 TriAttention을 제안합니다. 삼각함수 급수를 통해, 우리는 이러한 중심들에 의해 규정되는 거리 선호도를 이용하여 키들을 그들의 위치에 따라 점수 매기고, 또한 중요도 추정을 위한 추가 신호로 Q/K 노름(norm)을 활용합니다. 32K 토큰 생성을 요구하는 AIME25 벤치마크에서 TriAttention은 전체 어텐션(Full Attention)의 추론 정확도를 유지하면서 처리량을 2.5배 높이거나 KV 메모리를 10.7배 줄였으며, 동일한 효율에서 선도적인 비교 방법들은 정확도가 약 절반에 그쳤습니다. TriAttention은 OpenClaw 모델을 단일 소비자용 GPU에 배치 가능하게 하며, 이는 긴 컨텍스트로 인해 전체 어텐션을 사용하면 메모리 부족이 발생했을 상황입니다.

English

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

TriAttention: 삼각 함수 KV 압축을 통한 효율적인 장거리 추론

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

초록

Support