지수적으로 감소하는 메모리로 어텐션을 증강하면 쿼리 인식 KV 희소성이 향상된다

초록

효율적인 추론은 주의 계산과 KV-캐시 접근이 비용을 지배하는 장문맥 언어 모델에 매우 중요하다. 최근 연구인 RAT+는 추론 시점에서 유연한 확장 주의를 가능하게 하는 순환 증강 주의 백본을 도입한다. 본 논문에서는 이 지수적으로 감소하는 메모리가 기존의 쿼리 인식 희소 추론 방법을 개선할 수 있는지 조사한다. Quest, MoBA, SnapKV를 포함한 대표적인 방법을 사용하여, RAT+가 8가지 바늘 더미 속 바늘 찾기 과제에서 희소 예산 전반에 걸쳐 표준 주의보다 일관되게 정확도를 향상시킴을 보여준다. 이러한 개선은 RAT+ 논문에서 공개된 체크포인트와, 추가 메모리 모듈로 100억 토큰 동안 사전 학습을 계속한 OLMo2-7B에서 모두 검증한다. 마지막으로, 이 메모리 모듈이 쿼리 인식 희소 추론에 도움이 되는 이유를 설명하는 두 가지 가설을 제시하고 이를 뒷받침하기 위한 목표 실험을 설계한다.

English

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.