Het verrijken van aandacht met exponentieel vervallend geheugen verbetert query-bewuste KV-sparsity

Samenvatting

Efficiënte inferentie is cruciaal voor taalmodellen met lange context, waarbij de kosten worden gedomineerd door aandachtsberekening en KV-cache-toegang. Recent werk, RAT+, introduceert een met recurrente aandacht versterkte backbone die flexibele gedilateerde aandacht mogelijk maakt tijdens inferentie. In dit artikel onderzoeken we of dit exponentieel vervallende geheugen ook bestaande query-bewuste schaarse inferentiemethoden kan verbeteren. Met behulp van representatieve methoden zoals Quest, MoBA en SnapKV tonen we aan dat RAT+ consistent de nauwkeurigheid verbetert ten opzichte van standaard aandacht bij verschillende schaarse budgetten in acht speld-in-een-hooiberg taken. Deze winst valideren we zowel op de gepubliceerde checkpoints uit het RAT+-artikel als op OLMo2-7B, waarvan we de pretraining voortzetten met de toegevoegde geheugenmodule voor 10B tokens. Tot slot stellen we twee hypothesen voor die verklaren waarom deze geheugenmodule query-bewuste schaarse inferentie ten goede komt, en ontwerpen we gerichte experimenten om deze te ondersteunen.

English

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.