重复优于多样：面向高效样本的德语语言建模高信号数据筛选方法

摘要

近期研究表明，将海量英语网络语料过滤为高质量子集能显著提升训练效率。然而对于德语、法语、日语等高资源非英语语言，激进过滤会形成战略困境：实践者应当通过单次训练大量轻过滤网络数据来优先保证多样性，还是通过严格过滤获得高质量核心语料并进行多轮训练来优先保证质量？我们以德语为例构建分层质量过滤器，对5亿份网络文档进行处理，通过比较过滤子集的多轮训练与多样化语料的单次训练来探究这一权衡。在不同模型规模和标记预算下的实验表明，重复训练高质量数据始终优于在更大规模轻过滤数据集上的单次训练。值得注意的是，即使经过7轮训练，性能差距依然存在。我们的研究结果表明，对于非英语大语言模型，通过质量过滤实现语义集中比单纯追求唯一数据量最大化为高效语言建模提供了更可行的路径。我们向研究社区发布了德语语言模型（命名为Boldt）及清洗后的评估基准。实验表明，尽管训练所用标记量比同类模型少10-360倍，这些模型仍取得了最先进的结果。

English

Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web data, or prioritize quality by strictly filtering for a high-quality core and repeating it over multiple epochs? We investigate this trade-off for German by constructing hierarchical quality filters applied to 500M web documents, comparing multi-epoch training on the filtered subsets against single-pass training on a diverse corpus. Our experiments across multiple model scales and token budgets show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets. Notably, the performance gap persists even after 7 epochs. Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume. We release our German language models (called Boldt), as well as our cleaned evaluation benchmarks to the research community. Our experiments indicate that they achieve state-of-the-art results despite training on 10-360x fewer tokens than comparable models.

重复优于多样：面向高效样本的德语语言建模高信号数据筛选方法

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

摘要

Support