以指數衰減記憶增強注意力可提升查詢感知的KV稀疏性

摘要

高效推理对长上下文语言模型至关重要，其中注意力计算和KV缓存访问是主要成本。近期工作RAT+提出了一种递归增强的注意力主干，支持推理时的灵活扩张注意力。本文探究这种指数衰减记忆能否改进现有查询感知稀疏推理方法。通过Quest、MoBA和SnapKV等代表性方法，我们证明在八项大海捞针任务中，RAT+在不同稀疏预算下均能持续提升标准注意力的准确性。我们不仅在RAT+论文已发布的检查点上验证了这些增益，也在使用附加记忆模块额外预训练100亿token的OLMo2-7B模型上进行了验证。最后，我们提出两个假设解释该记忆模块为何有利于查询感知稀疏推理，并设计了针对性的实验加以支持。

English

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.