ChatPaper.aiChatPaper

OPUS:邁向大型語言模型預訓練中每輪迭代的高效與原則性數據選擇

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

February 5, 2026
作者: Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang
cs.AI

摘要

隨著高品質公開文本趨於耗盡(即所謂的「數據牆」現象),預訓練正從追求更多標記轉向追求更優質的標記。然而,現有方法要么依賴於忽略訓練動態的啟發式靜態篩選器,要么使用基於原始梯度、具動態性但與優化器無關的標準。我們提出OPUS(優化器驅動的投影效用選擇),這是一種動態數據選擇框架,其將效用定義於優化器驅動的更新空間中。OPUS通過將候選數據經現代優化器塑造的有效更新,投影到從穩定、同分布代理目標推導出的方向上,從而對候選數據進行評分。為確保可擴展性,我們採用基於CountSketch的Ghost技術以提升計算效率,並結合Boltzmann採樣保障數據多樣性,僅產生4.7%的額外計算開銷。OPUS在多樣化語料庫、質量層級、優化器及模型規模下均取得顯著成果。在FineWeb與FineWeb-Edu數據集上對GPT-2 Large/XL進行300億標記的預訓練時,OPUS不僅超越工業級基線,甚至優於完整2000億標記的訓練效果。此外,當與工業級靜態篩選器結合使用時,OPUS能進一步提升預訓練效率,即使面對低質量數據亦然。更進一步,在Qwen3-8B-Base模型於SciencePedia上的持續預訓練中,OPUS僅使用5億標記即達到優於完整30億標記訓練的性能,展現了在專業領域中顯著的數據效率提升。
English
As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
PDF2792February 12, 2026