OPUS:迈向大规模语言模型预训练中每轮迭代的高效与原则性数据选择
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
February 5, 2026
作者: Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang
cs.AI
摘要
随着高质量公开文本趋于枯竭(即“数据墙”现象),预训练正从追求更多标记转向寻求更优标记。然而现有方法要么依赖忽略训练动态的启发式静态过滤器,要么采用基于原始梯度但脱离优化器的动态标准。我们提出OPUS(优化器驱动的投影效用选择框架),这一动态数据选择框架将效用定义在优化器驱动的更新空间中。OPUS通过将现代优化器塑造的有效更新投影到源自稳定同分布代理的目标方向上,对候选数据进行评分。为确保可扩展性,我们采用Ghost技术与CountSketch实现计算高效性,并通过玻尔兹曼采样保障数据多样性,仅产生4.7%的额外计算开销。OPUS在多样化语料库、质量层级、优化器和模型规模上均取得显著成果。在FineWeb与FineWeb-Edu数据集上对GPT-2 Large/XL进行300亿标记的预训练时,OPUS不仅超越工业级基线,甚至优于完整2000亿标记的训练效果。当与工业级静态过滤器结合时,OPUS能进一步提升预训练效率,即使面对低质量数据亦然。此外,在SciencePedia上对Qwen3-8B-Base进行持续预训练时,OPUS仅用50亿标记就达到30亿标记完整训练的更优性能,展现了在专业领域显著的数据效率提升。
English
As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.