ChatPaper.aiChatPaper

合成自举预训练

Synthetic bootstrapped pretraining

September 17, 2025
作者: Zitong Yang, Aonan Zhang, Hong Liu, Tatsunori Hashimoto, Emmanuel Candès, Chong Wang, Ruoming Pang
cs.AI

摘要

我們提出了一種名為合成引導預訓練(Synthetic Bootstrapped Pretraining, SBP)的語言模型預訓練方法,該方法首先從預訓練數據集中學習文檔間關係的模型,然後利用該模型合成一個龐大的新語料庫進行聯合訓練。標準的預訓練方法教會語言模型學習單一文檔內詞元間的因果關聯,但並未設計來有效建模那些豐富且可學習的文檔間關聯,而這些關聯有可能帶來更好的性能。我們通過設計一個計算匹配的預訓練設置來驗證SBP,並從頭開始預訓練了一個擁有30億參數的模型,使用了高達1萬億個詞元。我們發現,SBP在強重複基線之上持續改進,並實現了相當於訪問20倍更多獨特數據的理論上限所能達到的性能提升。定性分析顯示,合成文檔超越了單純的改寫——SBP首先從原始材料中抽象出核心概念,然後在此基礎上構建新的敘述。除了強勁的實證性能外,SBP還自然地符合貝葉斯解釋:合成器隱含地學習了相關文檔間共享的潛在概念。
English
We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.
PDF82September 23, 2025