從潛在思維中推理學習
Reasoning to Learn from Latent Thoughts
March 24, 2025
作者: Yangjun Ruan, Neil Band, Chris J. Maddison, Tatsunori Hashimoto
cs.AI
摘要
語言模型(LM)預訓練的計算規模增長速度已超過人類書寫文本的增長,這引發了數據將成為LM規模擴展瓶頸的擔憂。為了在這種數據受限的情況下繼續推進預訓練,我們提出,通過顯式建模和推斷文本生成過程背後的潛在思維,可以顯著提高預訓練的數據效率。直觀上,我們的方法將網絡文本視為冗長人類思維過程的壓縮最終產物,而這些潛在思維包含了對數據高效學習至關重要的上下文知識和推理步驟。我們通過數學領域的數據受限持續預訓練,實證展示了我們方法的有效性。首先,我們展示了推斷潛在思維的合成數據方法顯著提升了數據效率,其表現優於相同數量原始數據的訓練(在MATH數據集上從5.7%提升至25.4%)。此外,我們展示了在沒有強教師模型的情況下進行潛在思維推斷,其中LM通過使用EM算法迭代提升訓練LM的能力及思維增強預訓練數據的質量,從而自舉其性能。我們證明,一個10億參數的LM能夠在至少三次迭代中自舉其性能,並顯著超越基於原始數據訓練的基線模型,且在執行E步驟時,隨著推理計算的增加,收益也隨之增加。推理規模擴展和EM迭代帶來的收益,為數據受限的預訓練規模擴展提供了新的機遇。
English
Compute scaling for language model (LM) pretraining has outpaced the growth
of human-written texts, leading to concerns that data will become the
bottleneck to LM scaling. To continue scaling pretraining in this
data-constrained regime, we propose that explicitly modeling and inferring the
latent thoughts that underlie the text generation process can significantly
improve pretraining data efficiency. Intuitively, our approach views web text
as the compressed final outcome of a verbose human thought process and that the
latent thoughts contain important contextual knowledge and reasoning steps that
are critical to data-efficient learning. We empirically demonstrate the
effectiveness of our approach through data-constrained continued pretraining
for math. We first show that synthetic data approaches to inferring latent
thoughts significantly improve data efficiency, outperforming training on the
same amount of raw data (5.7\% rightarrow 25.4\% on MATH). Furthermore, we
demonstrate latent thought inference without a strong teacher, where an LM
bootstraps its own performance by using an EM algorithm to iteratively improve
the capability of the trained LM and the quality of thought-augmented
pretraining data. We show that a 1B LM can bootstrap its performance across at
least three iterations and significantly outperform baselines trained on raw
data, with increasing gains from additional inference compute when performing
the E-step. The gains from inference scaling and EM iterations suggest new
opportunities for scaling data-constrained pretraining.Summary
AI-Generated Summary