1조 토큰 규모의 데이터 저장소를 활용한 검색 기반 언어 모델의 확장

초록

훈련 데이터의 양과 파라미터 수에 대한 스케일링 법칙은 다양한 구성에서 언어 모델(LM)을 사전 학습할 때의 비용-효익 트레이드오프를 예측할 수 있게 해줍니다. 본 논문에서는 스케일링의 또 다른 차원인 추론 시점에 사용 가능한 데이터의 양을 고려합니다. 구체적으로, 검색 기반 LM에서 사용하는 데이터 저장소의 크기를 늘리면 언어 모델링과 여러 다운스트림 작업에서 포화 현상 없이 단조롭게 성능이 향상되며, 이는 더 작은 모델이 대규모 데이터 저장소와 결합되었을 때 지식 집약적 작업에서 더 큰 LM 단독 모델을 능가할 수 있음을 보여줍니다. 데이터 저장소, 모델, 사전 학습 데이터 크기를 다양하게 조정하여 계산 최적의 스케일링 곡선을 그려봄으로써, 동일한 훈련 계산 예산 내에서 더 큰 데이터 저장소를 사용하면 모델 성능이 크게 향상될 수 있음을 입증합니다. 우리는 이 연구를 위해 MassiveDS라는 1.4조 토큰 규모의 데이터 저장소를 구축했으며, 이는 현재까지 공개된 검색 기반 LM용 데이터 저장소 중 가장 크고 다양성을 갖춘 것입니다. 또한, 계산적으로 접근 가능한 방식으로 데이터 저장소 스케일링을 연구하기 위한 효율적인 파이프라인을 설계했습니다. 마지막으로, 검색기(retriever) 개선, 데이터 저장소 품질 필터링 및 기타 설계 선택이 관찰된 스케일링 경향에 미치는 영향을 분석합니다. 전반적으로, 우리의 결과는 데이터 저장소 크기가 LM의 효율성과 성능 트레이드오프의 핵심 요소로 고려되어야 함을 보여줍니다. 향후 연구를 촉진하기 위해, 우리는 데이터 저장소와 코드를 https://github.com/RulinShao/retrieval-scaling에서 공개합니다.

English

Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations. In this paper, we consider another dimension of scaling: the amount of data available at inference time. Specifically, we find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation, such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget. We carry out our study by constructing a 1.4 trillion-token datastore named MassiveDS, which is the largest and the most diverse open-sourced datastore for retrieval-based LMs to date, and designing an efficient pipeline for studying datastore scaling in a computationally accessible manner. Finally, we analyze the effect of improving the retriever, datastore quality filtering, and other design choices on our observed scaling trends. Overall, our results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. To facilitate future research, we open-source our datastore and code at https://github.com/RulinShao/retrieval-scaling.

1조 토큰 규모의 데이터 저장소를 활용한 검색 기반 언어 모델의 확장

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

초록

Support