Dolma: 言語モデル事前学習研究のための3兆トークン規模のオープンコーパス

要旨

言語モデルは、幅広い自然言語処理タスクに対処するための重要な技術となっているが、最高性能を発揮する言語モデルがどのように開発されたかについての詳細は多くが報告されていない。特に、その事前学習コーパスに関する情報はほとんど議論されない。商用の言語モデルは、そのデータに関する情報を提供することは稀であり、オープンモデルでさえ、トレーニングに使用されたデータセットやそれを再現するための正確なレシピを公開することはほとんどない。その結果、トレーニングデータがモデルの能力にどのような影響を与え、その限界をどのように形作るかを理解するといった、言語モデリング研究の特定の方向性を進めることが困難となっている。言語モデルの事前学習に関するオープンな研究を促進するため、我々はDolmaという3兆トークンからなる英語コーパスを公開した。このコーパスは、ウェブコンテンツ、科学論文、コード、パブリックドメインの書籍、ソーシャルメディア、百科事典資料など、多様な情報源から構築されている。さらに、我々の作業をさらに実験し再現するために、データキュレーションツールキットをオープンソース化した。本報告書では、Dolmaの設計原則、構築の詳細、およびその内容の概要を記録している。また、Dolmaの中間状態で言語モデルをトレーニングした結果から得られた分析と実験結果を織り交ぜ、コンテンツや品質フィルターの役割、重複排除、複数ソースの混合など、重要なデータキュレーションの実践について学んだことを共有する。Dolmaは、言語モデリングの科学を構築し研究するために設計された最先端のオープン言語モデルおよびフレームワークであるOLMoのトレーニングに使用されている。

English

Language models have become a critical technology to tackling a wide range of natural language processing tasks, yet many details about how the best-performing language models were developed are not reported. In particular, information about their pretraining corpora is seldom discussed: commercial language models rarely provide any information about their data; even open models rarely release datasets they are trained on, or an exact recipe to reproduce them. As a result, it is challenging to conduct certain threads of language modeling research, such as understanding how training data impacts model capabilities and shapes their limitations. To facilitate open research on language model pretraining, we release Dolma, a three trillion tokens English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. In addition, we open source our data curation toolkit to enable further experimentation and reproduction of our work. In this report, we document Dolma, including its design principles, details about its construction, and a summary of its contents. We interleave this report with analyses and experimental results from training language models on intermediate states of Dolma to share what we have learned about important data curation practices, including the role of content or quality filters, deduplication, and multi-source mixing. Dolma has been used to train OLMo, a state-of-the-art, open language model and framework designed to build and study the science of language modeling.

Dolma: 言語モデル事前学習研究のための3兆トークン規模のオープンコーパス

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

要旨

Support