利用指数衰减记忆增强注意力以提升查询感知的键值稀疏性

摘要

高效推理对于长上下文语言模型至关重要，其中注意力计算和键值缓存访问占据了主要成本。近期研究 RAT+ 引入了一种递归增强注意力骨干结构，可在推理时实现灵活的扩张注意力。本文探究了这种指数衰减记忆能否进一步改进现有的查询感知稀疏推理方法。通过 Quest、MoBA 和 SnapKV 等代表性方法，我们证明了在八个"大海捞针"任务中，RAT+ 在不同稀疏预算下均能持续提升标准注意力的准确性。我们不仅在 RAT+ 论文已发布的检查点上验证了这些改进，还在 OLMo2-7B 模型上进行了验证——该模型使用新增的记忆模块继续预训练了 100 亿词元。最后，我们提出了两个假设来解释该记忆模块为何有益于查询感知稀疏推理，并设计了针对性实验加以验证。

English

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.