ChatPaper.aiChatPaper

快速权重产品键记忆

Fast-weight Product Key Memory

January 2, 2026
作者: Tianyu Zhao, Llion Jones
cs.AI

摘要

现代语言模型中的序列建模层通常面临存储容量与计算效率之间的权衡。虽然Softmax注意力机制能以二次方的惊人计算成本提供无限存储,但线性变体虽效率较高却受限于固定大小的有限存储。我们提出快速权重乘积键记忆(FwPKM),这一新颖架构通过将稀疏的乘积键记忆(PKM)从静态模块转化为动态的"快速权重"情景记忆,从而化解了这一矛盾。与PKM不同,FwPKM在训练和推理阶段通过局部块级梯度下降动态更新参数,使模型能够快速记忆并检索输入序列中的新键值对。实验表明,FwPKM作为有效的情景记忆机制,可与标准模块的语义记忆形成互补,在长上下文数据集上实现显著困惑度降低。值得注意的是,在"大海捞针"评估中,FwPKM仅通过4K词元序列训练就能泛化至128K词元的上下文场景。
English
Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
PDF10January 6, 2026