潜在的な思考から学ぶための推論

要旨

言語モデル（LM）の事前学習における計算スケーリングは、人間が書いたテキストの成長を上回っており、データがLMスケーリングのボトルネックになる懸念が生じています。このデータ制約下で事前学習を継続的にスケールするために、テキスト生成プロセスの基盤となる潜在的な思考を明示的にモデル化し推論することが、事前学習のデータ効率を大幅に改善できると提案します。直感的に、私たちのアプローチはウェブテキストを冗長な人間の思考プロセスの圧縮された最終結果と見なし、潜在的な思考にはデータ効率的な学習に不可欠な重要な文脈知識と推論ステップが含まれていると考えます。数学におけるデータ制約下での継続的事前学習を通じて、このアプローチの有効性を実証します。まず、潜在的な思考を推論するための合成データアプローチがデータ効率を大幅に改善し、同じ量の生データでの学習を上回ることを示します（MATHで5.7\% → 25.4\%）。さらに、強力な教師なしで潜在的な思考を推論する方法を実証し、LMがEMアルゴリズムを使用して訓練されたLMの能力と思考強化された事前学習データの品質を反復的に改善することで、自身のパフォーマンスをブートストラップします。1BのLMが少なくとも3回の反復でパフォーマンスをブートストラップし、生データで訓練されたベースラインを大幅に上回り、Eステップを実行する際に追加の推論計算から得られる利益が増加することを示します。推論スケーリングとEM反復からの利益は、データ制約下での事前学習をスケールする新たな機会を示唆しています。

English

Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the latent thoughts that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency, outperforming training on the same amount of raw data (5.7\% rightarrow 25.4\% on MATH). Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM bootstraps its own performance by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.

潜在的な思考から学ぶための推論

Reasoning to Learn from Latent Thoughts

要旨

Support