BeyondWeb：万亿级预训练中合成数据扩展的经验与启示

摘要

近期在大语言模型（LLM）预训练领域的研究表明，单纯增加数据量最终会导致收益递减，遭遇数据瓶颈。为此，采用合成数据进行预训练已成为推动性能前沿的一种有前景的范式。尽管如此，影响合成数据质量的因素仍鲜为人知。在本研究中，我们推出了BeyondWeb，一个用于生成高质量预训练合成数据的框架。BeyondWeb显著扩展了传统网络规模数据集的能力，在一套包含14项基准测试的评估中，平均分别超越了当前最先进的合成预训练数据集Cosmopedia和Nemotron-CC的高质量合成子集（Nemotron-Synth）高达5.1个百分点（pp）和2.6个百分点。与开放网络数据相比，它实现了高达7.7倍的训练速度提升，与Nemotron-Synth相比也有2.7倍的提升。值得注意的是，一个在BeyondWeb上训练了180B tokens的3B模型，其表现优于在Cosmopedia上以相同token预算训练的8B模型。我们还从BeyondWeb中提炼出关于预训练合成数据的多项洞见：其优势的驱动因素、哪些数据需要改写及如何改写，以及模型大小和系列对数据质量的影响。总体而言，我们的研究表明，生成高质量的预训练合成数据并无单一妙招。最佳成果需要联合优化众多因素，这是一项既需要严谨科学又需实践经验的挑战性任务。简单的方法可能带来有限的改进，却可能付出高昂代价，而执行得当的方法则能带来变革性的提升，BeyondWeb便是明证。

English

Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.

BeyondWeb：万亿级预训练中合成数据扩展的经验与启示

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

摘要

Support