KletterMix：攀登高质量德语预训练数据

摘要

高品質的預訓練數據是現代語言模型的核心要素，但德語資源的發展遠不如英語資源：它們通常規模較小、精心策劃程度較低、文檔記錄不足，且很少透過受控的訓練實驗進行驗證。我們介紹KletterMix，一個高品質的德語語料庫，用於語言模型預訓練和退火，設計為自然語言處理與建模社群可重複使用的數據集成果。KletterMix的建構方式是將一個最先進的英語預訓練語料庫翻譯成德語，同時保留文檔邊界、元數據、來源結構和主題多樣性。這種建構方式產出一個具有現代預訓練數據集規模和多樣性的德語語料庫，同時允許與其英語來源進行直接比較。我們透過一系列廣泛的語料庫層級分析來記錄該數據集，包括翻譯品質、文檔長度分布、主題覆蓋、來源組成和地理元數據。使用COMETKiwi，我們展示了翻譯後的文檔在各個領域都達到高品質，表明仔細的翻譯可以保留原始語料庫大部分的語義和風格豐富性。除了數據集建構之外，我們還評估KletterMix作為訓練數據的效果。透過對比既有德語語料庫的受控預訓練和退火消融實驗，我們顯示在KletterMix上訓練的模型在德語下游評測中取得了可衡量的改進。這些結果證明，精心策劃的翻譯數據能顯著增強德語預訓練數據生態系統。

English

High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.