BeyondWeb: 兆規模の事前学習における合成データのスケーリングから得られた教訓

要旨

大規模言語モデル（LLM）の事前学習における最近の進展により、単にデータ量をスケールアップするだけでは、やがて収穫逓減に陥り、データの壁にぶつかることが明らかとなった。これに対応して、性能の限界を押し上げるための有望なパラダイムとして、合成データを用いた事前学習が注目されている。しかしながら、合成データの品質に影響を与える要因については、依然として十分に理解されていない。本研究では、事前学習用の高品質な合成データを生成するフレームワーク「BeyondWeb」を紹介する。BeyondWebは、従来のウェブスケールデータセットの能力を大幅に拡張し、最先端の合成事前学習データセットであるCosmopediaやNemotron-CCの高品質合成サブセット（Nemotron-Synth）を、14のベンチマーク評価全体でそれぞれ最大5.1パーセンテージポイント（pp）および2.6pp上回る性能を示す。また、オープンウェブデータと比較して最大7.7倍、Nemotron-Synthと比較して最大2.7倍の高速な学習を実現する。注目すべきは、180Bトークンで訓練された3Bモデルが、同じトークン予算でCosmopediaで訓練された8Bモデルを上回る点である。さらに、BeyondWebから得られた事前学習用合成データに関するいくつかの洞察を提示する。その利点を引き出す要因、どのデータをどのように言い換えるべきか、モデルサイズやファミリーがデータ品質に与える影響などである。全体として、本研究は、高品質な合成事前学習データを生成するための万能な解決策は存在しないことを示している。最良の結果を得るためには、多くの要因を共同で最適化する必要があり、これは厳密な科学と実践的な専門知識を要する挑戦的な課題である。単純なアプローチでは、多大なコストをかけてささやかな改善しか得られない可能性がある一方、適切に実行された方法では、BeyondWebが示すように、革新的な改善をもたらすことができる。

English

Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.

BeyondWeb: 兆規模の事前学習における合成データのスケーリングから得られた教訓

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

要旨

Support