通過兆級標記數據存儲庫擴展基於檢索的語言模型

摘要

關於訓練數據量和參數數量的比例律，讓我們能夠預測在不同配置下預訓練語言模型（LMs）的成本效益折衷。在本文中，我們考慮另一個比例律：推理時可用數據的量。具體而言，我們發現增加檢索型LM使用的數據存儲庫的大小會單調地改善語言建模和幾個下游任務，並沒有明顯的飽和現象，因此，一個較小的模型搭配一個大型數據存儲庫在知識密集型任務上勝過僅有較大LM的模型。通過繪製具有不同數據存儲庫、模型和預訓練數據大小的計算最優比例曲線，我們展示了使用更大的數據存儲庫可以顯著提高模型性能，而在相同的訓練計算預算下進行。我們通過構建一個名為MassiveDS的包含1.4兆令牌的數據存儲庫來進行研究，這是迄今為止最大且最多樣化的開源檢索型LM數據存儲庫，並設計了一個有效的流程來以可計算的方式研究數據存儲庫的比例律。最後，我們分析了改進檢索器、數據存儲庫質量篩選和其他設計選擇對我們觀察到的比例律趨勢的影響。總的來說，我們的結果顯示，應將數據存儲庫大小視為LM效率和性能折衷的一部分。為了促進未來研究，我們在https://github.com/RulinShao/retrieval-scaling 開源了我們的數據存儲庫和代碼。

English

Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations. In this paper, we consider another dimension of scaling: the amount of data available at inference time. Specifically, we find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation, such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget. We carry out our study by constructing a 1.4 trillion-token datastore named MassiveDS, which is the largest and the most diverse open-sourced datastore for retrieval-based LMs to date, and designing an efficient pipeline for studying datastore scaling in a computationally accessible manner. Finally, we analyze the effect of improving the retriever, datastore quality filtering, and other design choices on our observed scaling trends. Overall, our results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. To facilitate future research, we open-source our datastore and code at https://github.com/RulinShao/retrieval-scaling.

通過兆級標記數據存儲庫擴展基於檢索的語言模型

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

摘要

Support