KletterMix: 고품질 독일어 사전학습 데이터를 향한 등반

초록

고품질 사전 학습 데이터는 현대 언어 모델의 핵심 요소이지만, 독일어 자원은 영어 자원에 비해 훨씬 덜 발달되어 있다. 즉, 규모가 더 작고, 선별 수준이 낮으며, 문서화가 미흡하고, 통제된 훈련 실험을 통한 검증도 거의 이루어지지 않았다. 본 연구에서는 언어 모델 사전 학습 및 어닐링(annealing)을 위한 고품질 독일어 코퍼스인 KletterMix를 소개한다. 이는 자연어 처리 및 모델링 커뮤니티를 위해 재사용 가능한 데이터셋 아티팩트로 설계되었다. KletterMix는 최첨단 영어 사전 학습 코퍼스를 독일어로 번역하되, 문서 경계, 메타데이터, 원천 구조 및 주제 다양성을 유지함으로써 구축되었다. 이러한 구축 방식은 현대 사전 학습 데이터셋의 규모와 다양성을 갖춘 독일어 코퍼스를 제공함과 동시에, 영어 원천과의 직접적인 비교를 가능하게 한다. 번역 품질, 문서 길이 분포, 주제 범위, 원천 구성 및 지리적 메타데이터를 포함한 광범위한 코퍼스 수준 분석을 통해 데이터셋을 문서화한다. COMETKiwi를 사용하여 번역된 문서가 다양한 도메인에 걸쳐 높은 품질을 달성함을 보여주며, 이는 신중한 번역이 원본 코퍼스의 의미론적 및 문체적 풍부함을 상당 부분 보존할 수 있음을 시사한다. 데이터셋 구축을 넘어, KletterMix를 훈련 데이터로 평가한다. 기존 독일어 코퍼스와의 통제된 사전 학습 및 어닐링 절제 실험(ablation)을 통해, KletterMix로 훈련된 모델이 독일어 다운스트림 평가에서 측정 가능한 개선을 달성함을 보여준다. 이러한 결과는 신중하게 선별된 번역 데이터가 독일어 사전 학습 데이터 생태계를 실질적으로 강화할 수 있음을 입증한다.

English

High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.