重複勝於多樣性：高效能資料篩選實現樣本節約型德語語言建模

摘要

近期研究顯示，將大規模英語網路語料庫過濾為高品質子集能顯著提升訓練效率。然而對於德語、法語或日語等高資源非英語語言，激進的過濾策略會產生戰略困境：實踐者應優先考慮多樣性（對大量輕度過濾的網路資料進行單次訓練），還是應優先考慮品質（嚴格過濾出高品質核心語料並進行多輪重複訓練）？我們針對德語構建分層品質過濾器，將其應用於5億份網路文檔，通過比較過濾子集的多輪訓練與多樣化語料的單次訓練來探究此權衡。在多種模型規模和詞元預算下的實驗表明，重複訓練高品質資料始終優於對更大規模輕度過濾資料集的單次訓練。值得注意的是，即使經過7輪訓練，性能差距依然存在。我們的研究結果表明，對非英語大型語言模型而言，透過品質過濾實現語義濃縮比單純追求獨特資料量更能有效達成高效的語言建模。我們向學術界發布了名為Boldt的德語語言模型及清理後的評估基準。實驗數據顯示，儘管訓練所用詞元量比同類模型少10-360倍，這些模型仍能取得最先進的成果。

English

Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web data, or prioritize quality by strictly filtering for a high-quality core and repeating it over multiple epochs? We investigate this trade-off for German by constructing hierarchical quality filters applied to 500M web documents, comparing multi-epoch training on the filtered subsets against single-pass training on a diverse corpus. Our experiments across multiple model scales and token budgets show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets. Notably, the performance gap persists even after 7 epochs. Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume. We release our German language models (called Boldt), as well as our cleaned evaluation benchmarks to the research community. Our experiments indicate that they achieve state-of-the-art results despite training on 10-360x fewer tokens than comparable models.

重複勝於多樣性：高效能資料篩選實現樣本節約型德語語言建模

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

摘要

Support