Common Pile v0.1:一个包含8TB公共领域及开放许可文本的数据集
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
June 5, 2025
作者: Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray
cs.AI
摘要
大型语言模型(LLMs)通常基于海量未经授权的文本进行训练,这一做法因可能涉及知识产权侵权及伦理问题而受到审视。采用公开授权文本训练LLMs是解决这些问题的初步尝试,但此前数据收集工作所得的数据集规模过小或质量欠佳,难以训练出高性能的LLMs。为填补这一空白,我们收集、整理并发布了Common Pile v0.1,这是一个专为LLM预训练设计的、总量达8TB的公开授权文本集合。Common Pile汇集了来自30个不同领域的内容,涵盖研究论文、代码、书籍、百科全书、教育材料、音频转录文本等。尤为关键的是,我们通过在Common Pile文本上训练两个70亿参数的LLMs——Comma v0.1-1T和Comma v0.1-2T(分别基于1万亿和2万亿token训练),验证了我们的努力。这两个模型在性能上均能与使用相似计算预算、基于未授权文本训练的LLMs(如Llama 1和2 7B)相媲美。除了发布Common Pile v0.1本身,我们还公开了其构建过程中使用的代码,以及Comma v0.1模型的训练混合方案和检查点。
English
Large language models (LLMs) are typically trained on enormous quantities of
unlicensed text, a practice that has led to scrutiny due to possible
intellectual property infringement and ethical concerns. Training LLMs on
openly licensed text presents a first step towards addressing these issues, but
prior data collection efforts have yielded datasets too small or low-quality to
produce performant LLMs. To address this gap, we collect, curate, and release
the Common Pile v0.1, an eight terabyte collection of openly licensed text
designed for LLM pretraining. The Common Pile comprises content from 30 sources
that span diverse domains including research papers, code, books,
encyclopedias, educational materials, audio transcripts, and more. Crucially,
we validate our efforts by training two 7 billion parameter LLMs on text from
the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion
tokens respectively. Both models attain competitive performance to LLMs trained
on unlicensed text with similar computational budgets, such as Llama 1 and 2
7B. In addition to releasing the Common Pile v0.1 itself, we also release the
code used in its creation as well as the training mixture and checkpoints for
the Comma v0.1 models.