合成ブートストラップ事前学習

要旨

本論文では、Synthetic Bootstrapped Pretraining (SBP) を提案する。SBP は、言語モデル (LM) の事前学習手法であり、まず事前学習データセットから文書間の関係を学習し、その後、その関係を活用して新たに大規模なコーパスを合成し、共同学習を行う。標準的な事前学習では、単一の文書内のトークン間の因果関係を学習するが、より優れた性能を発揮する可能性のある、学習可能な文書間の豊かな相関関係を効率的にモデル化するようには設計されていない。我々は、計算量を一致させた事前学習設定を設計し、3B パラメータのモデルを最大 1T トークンでゼロから事前学習することで SBP を検証した。その結果、SBP は強力な反復ベースラインを一貫して上回り、20 倍のユニークなデータにアクセス可能なオラクル上限値で達成可能な性能向上の大部分をもたらすことがわかった。質的分析により、合成された文書は単なる言い換えを超えており、SBP はまずシード素材から中核概念を抽象化し、その上に新たな叙述を構築することが明らかになった。強力な経験的性能に加えて、SBP は自然なベイズ的解釈を許容する：合成器は、関連文書間で共有される潜在概念を暗黙的に抽象化することを学習する。

English

We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.

合成ブートストラップ事前学習

Synthetic bootstrapped pretraining

要旨

Support