OPUS：邁向大型語言模型預訓練中每輪迭代的高效與原則性數據選擇

摘要

隨著高品質公開文本趨於耗盡（即所謂的「數據牆」現象），預訓練正從追求更多標記轉向追求更優質的標記。然而，現有方法要么依賴於忽略訓練動態的啟發式靜態篩選器，要么使用基於原始梯度、具動態性但與優化器無關的標準。我們提出OPUS（優化器驅動的投影效用選擇），這是一種動態數據選擇框架，其將效用定義於優化器驅動的更新空間中。OPUS通過將候選數據經現代優化器塑造的有效更新，投影到從穩定、同分布代理目標推導出的方向上，從而對候選數據進行評分。為確保可擴展性，我們採用基於CountSketch的Ghost技術以提升計算效率，並結合Boltzmann採樣保障數據多樣性，僅產生4.7%的額外計算開銷。OPUS在多樣化語料庫、質量層級、優化器及模型規模下均取得顯著成果。在FineWeb與FineWeb-Edu數據集上對GPT-2 Large/XL進行300億標記的預訓練時，OPUS不僅超越工業級基線，甚至優於完整2000億標記的訓練效果。此外，當與工業級靜態篩選器結合使用時，OPUS能進一步提升預訓練效率，即使面對低質量數據亦然。更進一步，在Qwen3-8B-Base模型於SciencePedia上的持續預訓練中，OPUS僅使用5億標記即達到優於完整30億標記訓練的性能，展現了在專業領域中顯著的數據效率提升。

English

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

OPUS：邁向大型語言模型預訓練中每輪迭代的高效與原則性數據選擇

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

摘要

Support