德国公共语料库——为德语语言模型提供的1540亿个开放授权文本单元
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models
October 15, 2025
作者: Lukas Gienapp, Christopher Schröder, Stefan Schweter, Christopher Akiki, Ferdinand Schlatt, Arden Zimmermann, Phillipe Genêt, Martin Potthast
cs.AI
摘要
大规模语言模型的开发依赖于大规模训练语料库,然而大多数语料库包含的数据其许可状态不明确,这限制了真正开放模型的发展。对于非英语语言而言,这一问题尤为严重,因为公开许可的文本仍然极度匮乏。我们引入了“德国公共资源”,这是迄今为止最大的公开许可德语文本集合。它汇集了来自七个领域的41个来源的数据,涵盖法律、科学、文化、政治、新闻、经济和网络文本。通过系统地从具有可验证许可的知名数据提供商处获取数据,它生成了1545.6亿个高质量文本标记,用于语言模型训练。我们的处理流程实施了全面的质量过滤、去重和文本格式修复,确保跨异质文本来源的一致性质量。所有领域子集均至少采用CC-BY-SA 4.0或同等许可,确保模型训练和再分发的法律合规性。因此,“德国公共资源”填补了公开许可德语预训练数据的关键空白,并促进了真正开放的德语语言模型的开发。我们还发布了针对德语文本的语料构建和数据过滤代码,使“德国公共资源”完全可复现且可扩展。
English
Large language model development relies on large-scale training corpora, yet
most contain data of unclear licensing status, limiting the development of
truly open models. This problem is exacerbated for non-English languages, where
openly licensed text remains critically scarce. We introduce the German
Commons, the largest collection of openly licensed German text to date. It
compiles data from 41 sources across seven domains, encompassing legal,
scientific, cultural, political, news, economic, and web text. Through
systematic sourcing from established data providers with verifiable licensing,
it yields 154.56 billion tokens of high-quality text for language model
training. Our processing pipeline implements comprehensive quality filtering,
deduplication, and text formatting fixes, ensuring consistent quality across
heterogeneous text sources. All domain subsets feature licenses of at least
CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and
redistribution. The German Commons therefore addresses the critical gap in
openly licensed German pretraining data, and enables the development of truly
open German language models. We also release code for corpus construction and
data filtering tailored to German language text, rendering the German Commons
fully reproducible and extensible.