TriAttention: Efficiënte Lange Redenering met Trigonometrische KV-compressie

Samenvatting

Uitgebreide redenering in grote taalmodellen (LLM's) veroorzaakt ernstige KV-cache-geheugenknelpunten. Toonaangevende KV-cachecompressiemethoden schatten het belang van KV in met behulp van aandachtsscores van recente post-RoPE queries. Echter, queries roteren met de positie tijdens RoPE, waardoor representatieve queries zeer schaars zijn, wat leidt tot slechte top-key selectie en onstabiele redenering. Om dit probleem te vermijden, wenden we ons tot de pre-RoPE-ruimte, waar we observeren dat Q- en K-vectoren sterk geconcentreerd zijn rond vaste niet-nul centra en stabiel blijven over posities heen – Q/K-concentratie. We tonen aan dat deze concentratie ertoe leidt dat queries de voorkeur geven aan keys op specifieke afstanden (bijvoorbeeld dichtstbijzijnde keys), waarbij de centra bepalen welke afstanden de voorkeur krijgen via een trigonometrische reeks. Op basis hiervan stellen we TriAttention voor om de belangrijkheid van keys in te schatten door gebruik te maken van deze centra. Via de trigonometrische reeks gebruiken we de afstandsvoorkeur gekarakteriseerd door deze centra om keys te scoren op basis van hun posities, en benutten we ook Q/K-normen als een extra signaal voor belangrijkheidsschatting. Op AIME25 met 32K-token-generatie evenaart TriAttention de redeneernauwkeurigheid van Full Attention, terwijl het een 2,5x hogere doorvoer of 10,7x KV-geheugenreductie bereikt, terwijl toonaangevende baseline-methoden slechts ongeveer de helft van de nauwkeurigheid bereiken bij dezelfde efficiëntie. TriAttention maakt implementatie van OpenClaw mogelijk op een enkele consumenten-GPU, waar een lange context anders tot geheugentekort zou leiden met Full Attention.

English

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

TriAttention: Efficiënte Lange Redenering met Trigonometrische KV-compressie

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Samenvatting

Support