超越網絡:從萬億規模預訓練的合成數據擴展中汲取的教訓
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
August 14, 2025
作者: Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt
cs.AI
摘要
近期在大規模語言模型(LLM)預訓練領域的進展表明,單純增加數據量最終會導致收益遞減,遭遇數據瓶頸。對此,利用合成數據進行預訓練已成為推動性能前沿的一種有前景的範式。然而,影響合成數據質量的因素仍知之甚少。本研究中,我們引入了BeyondWeb,這是一個用於生成高質量合成數據的框架,專為預訓練設計。BeyondWeb顯著擴展了傳統網絡規模數據集的能力,在一系列14項基準評估中,其平均表現分別超過了當前最先進的合成預訓練數據集Cosmopedia和Nemotron-CC的高質量合成子集(Nemotron-Synth)達5.1個百分點(pp)和2.6個百分點。與開放網絡數據相比,它提供了高達7.7倍的訓練速度提升,與Nemotron-Synth相比則提升了2.7倍。值得注意的是,在BeyondWeb上訓練了180B個token的3B模型,其表現優於在Cosmopedia上以相同token預算訓練的8B模型。我們還從BeyondWeb中提煉了關於合成數據預訓練的幾點洞見:其優勢的驅動因素、哪些數據需要重述及如何重述,以及模型規模和家族對數據質量的影響。總體而言,我們的工作表明,生成高質量合成預訓練數據並無萬能之策。最佳結果需要多因素聯合優化,這是一項需要嚴謹科學與實踐經驗的挑戰性任務。簡單的方法可能帶來有限的改進,卻可能付出巨大代價,而執行得當的方法則能帶來變革性的提升,BeyondWeb便是明證。
English
Recent advances in large language model (LLM) pretraining have shown that
simply scaling data quantity eventually leads to diminishing returns, hitting a
data wall. In response, the use of synthetic data for pretraining has emerged
as a promising paradigm for pushing the frontier of performance. Despite this,
the factors affecting synthetic data quality remain poorly understood. In this
work, we introduce BeyondWeb, a synthetic data generation framework that
produces high-quality synthetic data for pretraining. BeyondWeb significantly
extends the capabilities of traditional web-scale datasets, outperforming
state-of-the-art synthetic pretraining datasets such as Cosmopedia and
Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1
percentage points (pp) and 2.6pp, respectively, when averaged across a suite of
14 benchmark evaluations. It delivers up to 7.7x faster training than open web
data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for
180B tokens on BeyondWeb outperforms an 8B model trained for the same token
budget on Cosmopedia. We also present several insights from BeyondWeb on
synthetic data for pretraining: what drives its benefits, which data to
rephrase and how, and the impact of model size and family on data quality.
Overall, our work shows that there's no silver bullet for generating
high-quality synthetic pretraining data. The best outcomes require jointly
optimizing many factors, a challenging task that requires rigorous science and
practical expertise. Naive approaches can yield modest improvements,
potentially at great cost, while well-executed methods can yield transformative
improvements, as exemplified by BeyondWeb.