Dolma：一個包含三兆標記的開放語料庫，用於語言模型預訓練研究。

摘要

語言模型已成為應對各種自然語言處理任務的關鍵技術，然而許多最佳表現的語言模型是如何開發的細節並未報告。特別是，有關它們的預訓練語料庫的資訊很少被討論：商業語言模型很少提供有關其數據的任何信息；即使是開放模型也很少公開它們所接受訓練的數據集，或者確切的製作方法以重現它們。因此，進行某些語言建模研究是具有挑戰性的，例如了解訓練數據如何影響模型的能力並塑造其限制。為促進關於語言模型預訓練的開放研究，我們發布了Dolma，一個由多樣化的網絡內容、科學論文、代碼、公共領域書籍、社交媒體和百科全書材料組成的三兆標記英文語料庫。此外，我們開源我們的數據整理工具包，以便進一步進行實驗和重現我們的工作。在本報告中，我們記錄了Dolma，包括其設計原則、構建細節和內容摘要。我們將這份報告與在Dolma的中間狀態上訓練語言模型的分析和實驗結果交替呈現，以分享我們對重要數據整理實踐的瞭解，包括內容或質量篩選器、去重和多源混合的作用。Dolma已被用於訓練OLMo，一個最先進的、開放的語言模型和框架，旨在構建和研究語言建模科學。

English

Language models have become a critical technology to tackling a wide range of natural language processing tasks, yet many details about how the best-performing language models were developed are not reported. In particular, information about their pretraining corpora is seldom discussed: commercial language models rarely provide any information about their data; even open models rarely release datasets they are trained on, or an exact recipe to reproduce them. As a result, it is challenging to conduct certain threads of language modeling research, such as understanding how training data impacts model capabilities and shapes their limitations. To facilitate open research on language model pretraining, we release Dolma, a three trillion tokens English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. In addition, we open source our data curation toolkit to enable further experimentation and reproduction of our work. In this report, we document Dolma, including its design principles, details about its construction, and a summary of its contents. We interleave this report with analyses and experimental results from training language models on intermediate states of Dolma to share what we have learned about important data curation practices, including the role of content or quality filters, deduplication, and multi-source mixing. Dolma has been used to train OLMo, a state-of-the-art, open language model and framework designed to build and study the science of language modeling.

Dolma：一個包含三兆標記的開放語料庫，用於語言模型預訓練研究。

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

摘要

Support