GradMem：基于测试时梯度下降的上下文记忆学习算法

摘要

许多大型语言模型应用需要以长上下文为条件。Transformer模型通常通过存储庞大的逐层键值缓存（KV-cache）来支持这一功能，但这会带来显著的内存开销。一种理想的替代方案是压缩记忆机制：一次性读取上下文，将其压缩存储为紧凑状态，并基于该状态响应多个查询。我们在上下文移除场景下研究该机制，即模型在推理时必须在不接触原始上下文的情况下生成答案。我们提出GradMem方法，通过逐样本的测试时优化将上下文写入记忆。给定上下文后，GradMem在保持模型权重冻结的前提下，对一小组前缀记忆标记执行数步梯度下降。该方法显式优化模型层级的自监督上下文重构损失，形成具有迭代误差校正功能的损失驱动写入机制，这与仅前向传播的方法形成鲜明对比。在关联键值检索任务中，GradMem在相同记忆容量下优于仅前向传播的记忆写入方法，且额外梯度步数比重复前向写入更能有效扩展容量。我们进一步证明GradMem可迁移至合成基准测试之外：在预训练语言模型上，仅依靠记忆编码信息即可在bAbI和SQuAD变体等自然语言任务中取得具有竞争力的结果。

English

Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

GradMem：基于测试时梯度下降的上下文记忆学习算法

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

摘要

Support