用於高效長上下文生成的上下文記憶

摘要

現代大型語言模型（LLM）應用在推論時，越來越依賴長條件前綴來控制模型行為。雖然前綴增強推論效果顯著，但存在兩個結構性限制：一、前綴的影響力會隨著生成過程遞減；二、對前綴的注意力計算量與其長度呈線性增長。現有方法若非在壓縮前綴的同時仍將其保留在注意力機制中，便是透過梯度式訓練將其內化為模型參數。前者在推論時仍需對前綴進行注意力運算，而後者訓練成本高且不利於前綴更新。為解決這些問題，我們提出「注意力狀態記憶」——一種無需訓練的方法，將前綴外部化為輕量級、基於查找的記憶體，其中儲存前綴與查詢詞元之間預先計算好的注意力狀態。在搭配LLaMA-3.1-8B的ManyICLBench上，我們的方法在1K至8K記憶體預算下，準確率優於情境學習，並在8K時將注意力延遲減少1.36倍；同時在NBA基準測試中，僅使用完整注意力RAG記憶體足跡的20%即超越其表現。

English

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.