基于分层记忆的预训练:分离长尾与常识知识
Pretraining with hierarchical memories: separating long-tail and common knowledge
September 29, 2025
作者: Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel
cs.AI
摘要
現代語言模型令人矚目的性能提升,目前依賴於參數規模的擴展:更大的模型能存儲更多的世界知識並具備更強的推理能力。然而,將所有世界知識壓縮至模型參數中既無必要——因為每個提示僅需使用其中一小部分——也對推理時內存和計算資源有限的邊緣設備而言不切實際。針對這一不足,我們提出了一種記憶增強架構及與現有硬件範式相契合的預訓練策略。我們引入了小型語言模型,這些模型能夠訪問編碼了世界知識的大型分層參數記憶庫。在預訓練和推理過程中,我們提取一個與上下文相關的小型記憶塊並將其融入模型。我們的預訓練方法旨在學習將長尾世界知識存儲於記憶參數中,而小型語言模型則作為錨點,捕捉通用知識和一般推理能力。通過萬億級別的實驗,我們展示了顯著的性能提升:一個160M參數的模型,配備從4.6B記憶庫中提取的18M參數記憶,其性能可與參數數量超過其兩倍的常規模型相媲美。通過廣泛的實驗,我們研究了變壓器中參數記憶的最佳類型與規模,並將其擴展至超過21B參數。我們發現,所提出的分層前饋記憶無論是在預訓練期間還是後期添加,均能在各類變壓器架構中穩定工作。
English
The impressive performance gains of modern language models currently rely on
scaling parameters: larger models store more world knowledge and reason better.
Yet compressing all world knowledge into parameters is unnecessary, as only a
fraction is used per prompt, and impractical for edge devices with limited
inference-time memory and compute. We address this shortcoming by a
memory-augmented architecture and a pretraining strategy aligned with existing
hardware paradigms. We introduce small language models that access large
hierarchical parametric memory banks encoding world knowledge. During
pretraining and inference, we fetch a small, context-dependent memory block and
add it to the model. Our pretraining learns to store long-tail world knowledge
in the memory parameters, while the small language model acts as an anchor
capturing common knowledge and general reasoning abilities. Through
trillion-token-scale experiments, we show significant gains: a 160M-parameters
model augmented with an 18M-parameters memory fetched from a 4.6B memory bank
obtains comparable performance to a regular model with more than 2x the
parameters. Through extensive experiments, we study the optimal type and size
of parametric memories in transformers, scaling them to over 21B parameters. We
find that our proposed hierarchical feed-forward memories work robustly across
transformer architectures, whether added during pretraining or post-hoc.