KletterMix: 高品質なドイツ語事前学習データを目指して

要旨

高品質な事前学習データは現代の言語モデルにおいて中心的な要素であるが、ドイツ語リソースは英語のそれと比較して著しく発展が遅れている。ドイツ語のリソースは、多くの場合、規模が小さく、丁寧にキュレーションされておらず、文書化も不十分であり、制御されたトレーニング実験による検証もほとんど行われていない。我々は、言語モデルの事前学習およびアニーリング用の高品質なドイツ語コーパスであるKletterMixを紹介する。これは、自然言語処理およびモデリングコミュニティ向けの再利用可能なデータセット成果物として設計されている。KletterMixは、最先端の英語事前学習コーパスをドイツ語に翻訳し、文書の境界、メタデータ、ソース構造、トピックの多様性を保持することで構築されている。この構築方法により、現代の事前学習データセットと同等の規模と多様性を持つドイツ語コーパスが得られると同時に、英語のソースとの直接比較が可能となる。我々は、翻訳品質、文書長分布、トピックカバレッジ、ソース構成、地理的メタデータなど、コーパスレベルの広範な分析を通じてデータセットを文書化する。COMETKiwiを用いて、翻訳された文書が多様なドメインにわたって高い品質を達成していることを示し、注意深い翻訳によって元のコーパスの意味的・文体的豊かさの多くが保持されうることが示唆される。データセット構築に加えて、KletterMixをトレーニングデータとして評価する。確立されたドイツ語コーパスとの比較による制御された事前学習およびアニーリングアブレーションを通じて、KletterMixで学習されたモデルがドイツ語の下流評価において測定可能な改善を達成することを示す。これらの結果は、注意深くキュレーションされた翻訳データがドイツ語事前学習データのエコシステムを大幅に強化できることを実証している。

English

High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.