分层记忆预训练:分离长尾知识与共性知识
Pretraining with hierarchical memories: separating long-tail and common knowledge
September 29, 2025
作者: Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel
cs.AI
摘要
现代语言模型所展现的显著性能提升,目前主要依赖于参数规模的扩大:更大的模型能够存储更多的世界知识,并具备更强的推理能力。然而,将所有世界知识压缩到模型参数中既无必要——因为每个提示仅需使用其中的一小部分——也不适用于内存和计算资源受限的边缘设备。针对这一不足,我们提出了一种内存增强架构,并结合现有硬件范式设计了一种预训练策略。我们引入了小型语言模型,这些模型能够访问编码世界知识的大型分层参数化记忆库。在预训练和推理过程中,我们提取一个与上下文相关的小型记忆块,并将其融入模型。我们的预训练方法旨在将长尾世界知识存储于记忆参数中,而小型语言模型则作为锚点,捕捉通用知识及一般推理能力。通过万亿规模token的实验,我们展示了显著的性能提升:一个配备18M参数记忆库(源自4.6B参数记忆库)的160M参数模型,其性能可与参数规模超过其两倍的常规模型相媲美。通过大量实验,我们研究了Transformer中参数化记忆的最佳类型与规模,并将其扩展至超过210亿参数。我们发现,所提出的分层前馈记忆机制在Transformer架构中表现出良好的鲁棒性,无论是预训练阶段加入还是事后添加均能有效工作。
English
The impressive performance gains of modern language models currently rely on
scaling parameters: larger models store more world knowledge and reason better.
Yet compressing all world knowledge into parameters is unnecessary, as only a
fraction is used per prompt, and impractical for edge devices with limited
inference-time memory and compute. We address this shortcoming by a
memory-augmented architecture and a pretraining strategy aligned with existing
hardware paradigms. We introduce small language models that access large
hierarchical parametric memory banks encoding world knowledge. During
pretraining and inference, we fetch a small, context-dependent memory block and
add it to the model. Our pretraining learns to store long-tail world knowledge
in the memory parameters, while the small language model acts as an anchor
capturing common knowledge and general reasoning abilities. Through
trillion-token-scale experiments, we show significant gains: a 160M-parameters
model augmented with an 18M-parameters memory fetched from a 4.6B memory bank
obtains comparable performance to a regular model with more than 2x the
parameters. Through extensive experiments, we study the optimal type and size
of parametric memories in transformers, scaling them to over 21B parameters. We
find that our proposed hierarchical feed-forward memories work robustly across
transformer architectures, whether added during pretraining or post-hoc.