『The Common Pile v0.1：パブリックドメインおよびオープンライセンステキストの8TBデータセット』

要旨

大規模言語モデル（LLMs）は通常、膨大な量の無許可テキストを用いて訓練されており、この慣行は知的財産権の侵害や倫理的な懸念から精査の対象となっている。オープンライセンスのテキストを用いてLLMsを訓練することは、これらの問題に対処するための第一歩であるが、これまでのデータ収集の取り組みでは、高性能なLLMsを生成するには小さすぎるか、品質が低いデータセットしか得られていなかった。このギャップを埋めるため、我々はLLMの事前訓練用に設計された8テラバイトのオープンライセンステキストコレクションであるCommon Pile v0.1を収集、キュレーション、および公開した。Common Pileは、研究論文、コード、書籍、百科事典、教育資料、音声文字起こしなど、多様なドメインにわたる30のソースから構成されている。重要なことに、我々はCommon Pileのテキストを用いて2つの70億パラメータのLLMs、Comma v0.1-1TとComma v0.1-2T（それぞれ1兆および2兆トークンで訓練）を訓練し、その努力を検証した。両モデルは、Llama 1および2 7Bなど、同様の計算予算で無許可テキストを用いて訓練されたLLMsと競争力のある性能を達成した。Common Pile v0.1自体の公開に加えて、我々はその作成に使用したコード、およびComma v0.1モデルの訓練混合物とチェックポイントも公開した。

English

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

『The Common Pile v0.1：パブリックドメインおよびオープンライセンステキストの8TBデータセット』

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

要旨

Support