面向高效长上下文生成的上下文记忆化

摘要

现代大型语言模型（LLM）应用日益依赖长条件前缀，以在推理阶段控制模型行为。尽管前缀增强推理效果显著，但存在两个结构性局限：其一，前缀的影响力会随生成过程逐渐减弱；其二，对前缀的注意力计算量与其长度呈线性增长。现有方法要么在压缩前缀的同时将其保留在注意力机制中，要么通过基于梯度的训练将其内化为模型参数。前者在推理时仍需对前缀进行注意力计算，而后者训练成本高且不便于前缀更新。为解决这些问题，我们提出注意力状态记忆——一种无需训练的方法，通过将前缀与查询词元之间预计算的注意力状态外部化，构建轻量级、基于查找的记忆模块。在LLaMA-3.1-8B模型的ManyICLBench基准测试中，我们的方法在1K至8K记忆预算下相较于上下文学习提升了准确率，同时将8K长度下的注意力延迟降低了1.36倍；在NBA基准测试中，该方法仅使用全注意力RAG 20%的记忆占用，便超越了其性能表现。

English

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.