利用万亿标记数据存储扩展基于检索的语言模型

摘要

关于训练数据量和参数数量的规模律使我们能够预测在不同配置下预训练语言模型（LMs）的成本效益权衡。本文考虑了另一个规模方面：推断时可用数据量。具体而言，我们发现通过增加检索型LM使用的数据存储大小，语言建模和多个下游任务均呈单调改善趋势，且没有明显饱和点，因此，一个较小的模型搭配一个大型数据存储在知识密集型任务上胜过仅有较大LM的模型。通过绘制计算最优规模曲线，其中数据存储、模型和预训练数据大小各异，我们展示了使用更大数据存储可以显著提升模型性能，而训练计算预算相同。我们通过构建一个名为MassiveDS的1.4万亿标记数据存储进行研究，这是迄今为止用于检索型LM的最大和最多样化的开源数据存储，并设计了一个高效的流水线，以便以计算可访问的方式研究数据存储规模。最后，我们分析了改进检索器、数据存储质量过滤和其他设计选择对我们观察到的规模趋势的影响。总体而言，我们的结果表明数据存储大小应被视为LM效率和性能权衡的一个重要组成部分。为促进未来研究，我们在https://github.com/RulinShao/retrieval-scaling 开源了我们的数据存储和代码。

English

Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations. In this paper, we consider another dimension of scaling: the amount of data available at inference time. Specifically, we find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation, such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget. We carry out our study by constructing a 1.4 trillion-token datastore named MassiveDS, which is the largest and the most diverse open-sourced datastore for retrieval-based LMs to date, and designing an efficient pipeline for studying datastore scaling in a computationally accessible manner. Finally, we analyze the effect of improving the retriever, datastore quality filtering, and other design choices on our observed scaling trends. Overall, our results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. To facilitate future research, we open-source our datastore and code at https://github.com/RulinShao/retrieval-scaling.

利用万亿标记数据存储扩展基于检索的语言模型

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

摘要

Support