データ制約下の言語モデル事前学習における訓練時拡張の解明

要旨

AI研究所が、計算能力が高品質な新規テキスト生成の速度を上回るデータの天井に近づくにつれ、言語モデルの事前学習はデータ制約下かつ計算豊富な状況へと移行し、固定コーパスでの効率的なマルチエポック学習が求められている。この設定では、標準的な自己回帰（AR）事前学習は深刻な過学習を起こし、最適値に早期に到達した後、継続的に性能が低下する。本研究では、学習時のデータ拡張を正則化手法として導入し、この過学習を抑制し、同一データでの数百エポックにわたる効率的な学習を可能にする。AR事前学習のための三つの直交する拡張カテゴリ、すなわちトークンレベルのノイズ（マスキング、ランダム置換）、系列の順列操作（右から左への予測、Fill-in-the-Middle）、およびターゲットオフセット予測（i>1に対するx_{t+i}）を提案する。系統的なアブレーション実験の結果、個別の拡張手法はいずれも過学習を遅らせ、ベースラインと比較して検証損失を低減し、なかでもランダムトークン置換が個別手法の中で最小損失を達成した。さらに拡張カテゴリを組み合わせることで、最小検証損失はさらに低下する。本実験は、データ拡張がAR事前学習のデータ非効率性を緩和し、データ制約下の状況に対する有望な解決策を提供することを示す\footnote{全コードとデータは https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining で入手可能。}

English

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate training-time data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction (x_{t+i} for i > 1). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime~\footnote{All code and data are available at https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining.