커먼 파일 v0.1: 공개 도메인 및 개방형 라이선스 텍스트로 구성된 8TB 데이터셋

초록

대형 언어 모델(LLM)은 일반적으로 엄청난 양의 비허가 텍스트를 기반으로 훈련되며, 이는 지적 재산권 침해 및 윤리적 문제로 인해 비판을 받아 왔다. 공개적으로 허가된 텍스트를 사용하여 LLM을 훈련하는 것은 이러한 문제를 해결하기 위한 첫 번째 단계이지만, 기존의 데이터 수집 노력은 성능이 우수한 LLM을 생산하기에는 너무 작거나 저품질의 데이터셋을 산출해 왔다. 이러한 격차를 해결하기 위해 우리는 LLM 사전 훈련을 위해 설계된 8테라바이트 규모의 공개 허가 텍스트 컬렉션인 Common Pile v0.1을 수집, 정제 및 공개한다. Common Pile은 연구 논문, 코드, 책, 백과사전, 교육 자료, 오디오 트랜스크립트 등 다양한 도메인을 아우르는 30개의 출처로부터 구성된다. 특히, 우리는 Common Pile의 텍스트를 사용하여 두 개의 70억 파라미터 LLM인 Comma v0.1-1T와 Comma v0.1-2T를 각각 1조 및 2조 토큰으로 훈련함으로써 우리의 노력을 검증했다. 두 모델 모두 Llama 1 및 2 7B와 유사한 컴퓨팅 예산으로 비허가 텍스트를 기반으로 훈련된 LLM과 경쟁력 있는 성능을 달성했다. Common Pile v0.1 자체를 공개하는 것 외에도, 우리는 이를 생성하는 데 사용된 코드와 Comma v0.1 모델의 훈련 혼합물 및 체크포인트도 공개한다.

English

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

커먼 파일 v0.1: 공개 도메인 및 개방형 라이선스 텍스트로 구성된 8TB 데이터셋

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

초록

Support