합성 부트스트랩 사전 학습

초록

우리는 Synthetic Bootstrapped Pretraining(SBP)을 소개합니다. SBP는 언어 모델(LM)을 사전 학습하는 절차로, 먼저 사전 학습 데이터셋에서 문서 간의 관계를 모델링한 다음, 이를 활용하여 방대한 새로운 코퍼스를 합성하여 공동 학습을 수행합니다. 표준 사전 학습은 LM이 단일 문서 내 토큰 간의 인과적 상관관계를 학습하도록 설계되었지만, 잠재적으로 더 나은 성능을 이끌 수 있는 풍부하고 학습 가능한 문서 간 상관관계를 효율적으로 모델링하도록 설계되지는 않았습니다. 우리는 SBP를 검증하기 위해 계산 자원을 맞춘 사전 학습 설정을 설계하고, 최대 1조 개의 토큰을 사용하여 30억 개의 파라미터를 가진 모델을 처음부터 사전 학습했습니다. 그 결과, SBP는 강력한 반복 기반 베이스라인을 지속적으로 개선하며, 20배 더 많은 고유 데이터에 접근할 수 있는 오라클 상한선이 달성할 수 있는 성능 향상의 상당 부분을 제공하는 것으로 나타났습니다. 정성적 분석에 따르면, 합성된 문서는 단순한 패러프레이즈를 넘어서서, SBP가 먼저 시드 자료에서 핵심 개념을 추상화한 다음 그 위에 새로운 서사를 구축하는 것으로 나타났습니다. 강력한 실험적 성능 외에도, SBP는 자연스러운 베이지안 해석을 허용합니다: 합성기는 관련 문서 간에 공유되는 잠재 개념을 추상화하는 방법을 암묵적으로 학습합니다.

English

We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.

합성 부트스트랩 사전 학습

Synthetic bootstrapped pretraining

초록

Support