KletterMix：攀向高质量德语预训练数据

摘要

高质量预训练数据是现代语言模型的核心要素，但德语资源的开发程度远不及英语同类资源：其规模通常较小、筛选不够精细、文档记录薄弱，且很少通过受控训练实验进行验证。我们提出KletterMix——一个专为语言模型预训练与退火训练设计的高质量德语语料库，旨在为自然语言处理与建模社区提供可复用的数据集制品。KletterMix通过将先进英语预训练语料库翻译为德语构建而成，保留了文档边界、元数据、源结构及主题多样性。这种构建方式产出的德语语料库既具备现代预训练数据集的规模与多样性，又能直接与其英语源数据对比。我们通过广泛的语料库层面分析对数据集进行文档化，包括翻译质量、文档长度分布、主题覆盖、源构成及地理元数据。利用COMETKiwi评估表明，跨不同领域的翻译文档均达到高质量，表明精细翻译能较好保留原始语料的语义与风格丰富性。除数据集构建外，我们还将KletterMix作为训练数据进行评估。通过针对现有德语语料库的受控预训练与退火训练消融实验，我们发现基于KletterMix训练的模型在德语下游评估中实现可衡量的性能提升。这些结果表明，经精细筛选的翻译数据能显著增强德语预训练数据生态。

English

High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.