Common Pile v0.1：一个包含8TB公共领域及开放许可文本的数据集

摘要

大型语言模型（LLMs）通常基于海量未经授权的文本进行训练，这一做法因可能涉及知识产权侵权及伦理问题而受到审视。采用公开授权文本训练LLMs是解决这些问题的初步尝试，但此前数据收集工作所得的数据集规模过小或质量欠佳，难以训练出高性能的LLMs。为填补这一空白，我们收集、整理并发布了Common Pile v0.1，这是一个专为LLM预训练设计的、总量达8TB的公开授权文本集合。Common Pile汇集了来自30个不同领域的内容，涵盖研究论文、代码、书籍、百科全书、教育材料、音频转录文本等。尤为关键的是，我们通过在Common Pile文本上训练两个70亿参数的LLMs——Comma v0.1-1T和Comma v0.1-2T（分别基于1万亿和2万亿token训练），验证了我们的努力。这两个模型在性能上均能与使用相似计算预算、基于未授权文本训练的LLMs（如Llama 1和2 7B）相媲美。除了发布Common Pile v0.1本身，我们还公开了其构建过程中使用的代码，以及Comma v0.1模型的训练混合方案和检查点。

English

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

Common Pile v0.1：一个包含8TB公共领域及开放许可文本的数据集

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

摘要

Support