MAGA: 大規模ジャンル・視聴者再構築による事前学習コーパスの拡張

要旨

大規模言語モデルの顕著な能力にもかかわらず、その継続的なスケーリングは重要な課題に直面しています：高品質の事前学習データの不足です。モデルのアーキテクチャが進化し続ける一方で、自然言語データはスケールアップに苦労しています。このボトルネックに取り組むために、既存のコーパスから多様で文脈豊かな事前学習データを体系的に合成するMAssive Genre-Audience（MAGA）再構築手法を提案します。この研究は主に3つの貢献をします：（1）軽量かつスケーラブルな事前学習コーパス拡張手法であるMAGA再構築手法を提案し、770BトークンのMAGACorpusを構築します。（2）異なるデータ予算スケーリング戦略でMAGACorpusを評価し、様々なモデルサイズ（134M-13B）で一貫した改善を示し、次世代の大規模合成事前学習言語モデルの必要性を確立します。（3）包括的な分析を通じて、合成トレーニングの崩壊に対するプロンプトエンジニアリングの影響を調査し、検証損失を使用した従来の崩壊検出メトリックの限界を明らかにします。私たちの研究は、MAGAがトレーニングデータセットを大幅に拡張し、品質を維持しながら、データの制約を超えてモデルをスケーリングするための信頼できる経路を提供できることを示しています。

English

Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose MAssive Genre-Audience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering's impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.

MAGA: 大規模ジャンル・視聴者再構築による事前学習コーパスの拡張

MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

要旨

Support