合成引导式预训练

摘要

我们提出了一种名为合成引导预训练（Synthetic Bootstrapped Pretraining, SBP）的语言模型预训练方法。该方法首先从预训练数据集中学习文档间关系的模型，随后利用该模型合成一个庞大的新语料库进行联合训练。传统的预训练方法主要教导语言模型学习单个文档内标记间的因果关联，而并未设计用于高效建模那些可学习的、丰富的文档间关联，这些关联可能带来更优的性能。我们通过设计计算资源匹配的预训练设置，并从头开始预训练一个包含30亿参数的模型，使用多达1万亿个标记，验证了SBP的有效性。研究发现，SBP在强重复基线之上持续提升，实现了接近通过访问20倍独特数据所能达到的性能提升的显著部分。定性分析显示，合成文档超越了简单的改写——SBP首先从原始材料中提炼核心概念，然后在此基础上构建新的叙述。除了卓越的实证表现，SBP还自然地契合贝叶斯解释：合成器隐式地学习抽象出相关文档间共享的潜在概念。

English

We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.

合成引导式预训练

Synthetic bootstrapped pretraining

摘要

Support