利用万亿标记数据存储扩展基于检索的语言模型
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
July 9, 2024
作者: Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, Pang Wei Koh
cs.AI
摘要
关于训练数据量和参数数量的规模律使我们能够预测在不同配置下预训练语言模型(LMs)的成本效益权衡。本文考虑了另一个规模方面:推断时可用数据量。具体而言,我们发现通过增加检索型LM使用的数据存储大小,语言建模和多个下游任务均呈单调改善趋势,且没有明显饱和点,因此,一个较小的模型搭配一个大型数据存储在知识密集型任务上胜过仅有较大LM的模型。通过绘制计算最优规模曲线,其中数据存储、模型和预训练数据大小各异,我们展示了使用更大数据存储可以显著提升模型性能,而训练计算预算相同。我们通过构建一个名为MassiveDS的1.4万亿标记数据存储进行研究,这是迄今为止用于检索型LM的最大和最多样化的开源数据存储,并设计了一个高效的流水线,以便以计算可访问的方式研究数据存储规模。最后,我们分析了改进检索器、数据存储质量过滤和其他设计选择对我们观察到的规模趋势的影响。总体而言,我们的结果表明数据存储大小应被视为LM效率和性能权衡的一个重要组成部分。为促进未来研究,我们在https://github.com/RulinShao/retrieval-scaling 开源了我们的数据存储和代码。
English
Scaling laws with respect to the amount of training data and the number of
parameters allow us to predict the cost-benefit trade-offs of pretraining
language models (LMs) in different configurations. In this paper, we consider
another dimension of scaling: the amount of data available at inference time.
Specifically, we find that increasing the size of the datastore used by a
retrieval-based LM monotonically improves language modeling and several
downstream tasks without obvious saturation, such that a smaller model
augmented with a large datastore outperforms a larger LM-only model on
knowledge-intensive tasks. By plotting compute-optimal scaling curves with
varied datastore, model, and pretraining data sizes, we show that using larger
datastores can significantly improve model performance for the same training
compute budget. We carry out our study by constructing a 1.4 trillion-token
datastore named MassiveDS, which is the largest and the most diverse
open-sourced datastore for retrieval-based LMs to date, and designing an
efficient pipeline for studying datastore scaling in a computationally
accessible manner. Finally, we analyze the effect of improving the retriever,
datastore quality filtering, and other design choices on our observed scaling
trends. Overall, our results show that datastore size should be considered as
an integral part of LM efficiency and performance trade-offs. To facilitate
future research, we open-source our datastore and code at
https://github.com/RulinShao/retrieval-scaling.