指数減衰メモリを用いた注意機構の拡張によるクエリ認識型KVスパース性の向上

要旨

効率的な推論は、長文脈言語モデルにおいて重要であり、アテンション計算とKVキャッシュアクセスがコストの大部分を占める。最近の研究であるRAT+は、推論時に柔軟なダイレイテッドアテンションを可能にする再帰拡張アテンションバックボーンを導入した。本稿では、この指数関数的減衰メモリが既存のクエリ認識型スパース推論手法を改善できるかどうかを調査する。Quest、MoBA、SnapKVを含む代表的な手法を用いて、RAT+が標準アテンションと比較して、8つの「干し草の山から針を見つける」タスクにおいて、スパース予算全体にわたって一貫して精度を向上させることを示す。これらの改善は、RAT+論文で公開されたチェックポイントと、追加のメモリモジュールを用いて100億トークンにわたって事前学習を継続したOLMo2-7Bの両方において検証する。最後に、このメモリモジュールがクエリ認識型スパース推論に有効である理由について2つの仮説を提案し、それらを裏付けるための目的指向型実験を設計する。

English

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.