Dolma:用于语言模型预训练研究的三万亿令牌的开放语料库
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
January 31, 2024
作者: Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo
cs.AI
摘要
语言模型已成为处理各种自然语言处理任务的关键技术,然而,关于最佳表现的语言模型是如何开发的许多细节并未报告。特别是,关于它们的预训练语料库的信息很少被讨论:商业语言模型很少提供有关其数据的任何信息;即使是开放模型也很少发布它们训练所用的数据集,或者准确的复制方法。因此,进行某些语言建模研究是具有挑战性的,比如理解训练数据如何影响模型的能力并塑造其限制。为促进关于语言模型预训练的开放研究,我们发布了Dolma,一个包含三万亿标记的英语语料库,由各种网络内容、科学论文、代码、公共领域书籍、社交媒体和百科全书材料混合构建而成。此外,我们开源了我们的数据筛选工具包,以便进一步实验和复现我们的工作。在本报告中,我们记录了Dolma,包括其设计原则、构建细节和内容摘要。我们将这份报告与在Dolma的中间状态上训练语言模型的分析和实验结果交替进行,以分享我们对重要数据筛选实践的学习,包括内容或质量过滤器、去重和多源混合的作用。Dolma已被用于训练OLMo,一个最先进的开放语言模型和框架,旨在构建和研究语言建模科学。
English
Language models have become a critical technology to tackling a wide range of
natural language processing tasks, yet many details about how the
best-performing language models were developed are not reported. In particular,
information about their pretraining corpora is seldom discussed: commercial
language models rarely provide any information about their data; even open
models rarely release datasets they are trained on, or an exact recipe to
reproduce them. As a result, it is challenging to conduct certain threads of
language modeling research, such as understanding how training data impacts
model capabilities and shapes their limitations. To facilitate open research on
language model pretraining, we release Dolma, a three trillion tokens English
corpus, built from a diverse mixture of web content, scientific papers, code,
public-domain books, social media, and encyclopedic materials. In addition, we
open source our data curation toolkit to enable further experimentation and
reproduction of our work. In this report, we document Dolma, including its
design principles, details about its construction, and a summary of its
contents. We interleave this report with analyses and experimental results from
training language models on intermediate states of Dolma to share what we have
learned about important data curation practices, including the role of content
or quality filters, deduplication, and multi-source mixing. Dolma has been used
to train OLMo, a state-of-the-art, open language model and framework designed
to build and study the science of language modeling.