독일어 공용 데이터 - 독일어 언어 모델을 위한 공개 라이선스 텍스트 1540억 토큰

초록

대규모 언어 모델 개발은 대규모 훈련 코퍼스에 의존하지만, 대부분의 데이터는 라이선스 상태가 명확하지 않아 진정한 오픈 모델의 개발을 제한하고 있습니다. 이 문제는 비영어권 언어에서 더욱 심각한데, 공개적으로 라이선스된 텍스트가 여전히 극도로 부족하기 때문입니다. 본 연구에서는 지금까지 가장 큰 규모의 공개 라이선스 독일어 텍스트 컬렉션인 'German Commons'를 소개합니다. 이 컬렉션은 법률, 과학, 문화, 정치, 뉴스, 경제, 웹 텍스트 등 7개 도메인에 걸쳐 41개의 출처에서 데이터를 수집하였습니다. 검증 가능한 라이선스를 가진 기존 데이터 제공자로부터 체계적으로 수집함으로써, 언어 모델 훈련을 위한 고품질의 텍스트 1545억 6천만 토큰을 확보하였습니다. 우리의 처리 파이프라인은 포괄적인 품질 필터링, 중복 제거, 텍스트 포맷 수정을 구현하여 이질적인 텍스트 출처 간의 일관된 품질을 보장합니다. 모든 도메인 하위 집합은 최소 CC-BY-SA 4.0 또는 이에 상응하는 라이선스를 갖추고 있어, 모델 훈련 및 재배포에 대한 법적 준수를 보장합니다. 따라서 'German Commons'는 공개 라이선스 독일어 사전 훈련 데이터의 중요한 공백을 해소하고, 진정한 오픈 독일어 언어 모델의 개발을 가능하게 합니다. 또한, 독일어 텍스트에 맞춰진 코퍼스 구축 및 데이터 필터링 코드를 공개하여 'German Commons'를 완전히 재현 가능하고 확장 가능하도록 하였습니다.

English

Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.

독일어 공용 데이터 - 독일어 언어 모델을 위한 공개 라이선스 텍스트 1540억 토큰

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

초록

Support