《公共文本集v0.1：一个包含8TB公共领域及开放许可文本的数据集》

摘要

大型語言模型（LLMs）通常基於大量未經授權的文本進行訓練，這一做法因可能涉及知識產權侵權及倫理問題而受到審視。使用公開授權的文本訓練LLMs，是解決這些問題的第一步，但先前的數據收集工作所獲得的數據集規模過小或質量欠佳，無法訓練出高效能的LLMs。為填補這一空白，我們收集、整理並發布了Common Pile v0.1，這是一個專為LLM預訓練設計的、包含八兆字節公開授權文本的集合。Common Pile涵蓋了來自30個來源的內容，涉及研究論文、代碼、書籍、百科全書、教育材料、音頻轉錄稿等多個領域。關鍵在於，我們通過在Common Pile文本上訓練兩個擁有70億參數的LLMs——Comma v0.1-1T和Comma v0.1-2T（分別基於1兆和2兆標記進行訓練）——來驗證我們的工作成果。這兩個模型在與使用未授權文本訓練、計算預算相似的LLMs（如Llama 1和2 7B）相比時，展現出了競爭力的性能。除了發布Common Pile v0.1本身，我們還公開了其創建過程中使用的代碼，以及Comma v0.1模型的訓練混合比例和檢查點。

English

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

《公共文本集v0.1：一个包含8TB公共领域及开放许可文本的数据集》

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

摘要

Support