通过潜在思维进行推理学习
Reasoning to Learn from Latent Thoughts
March 24, 2025
作者: Yangjun Ruan, Neil Band, Chris J. Maddison, Tatsunori Hashimoto
cs.AI
摘要
语言模型(LM)预训练的计算规模增长速度已超越人类书写文本的增长,这引发了数据可能成为LM扩展瓶颈的担忧。为了在数据受限的情况下继续推进预训练,我们提出,通过显式建模并推断文本生成过程背后的潜在思维,可以显著提升预训练的数据效率。直观上,我们的方法将网络文本视为冗长人类思维过程的压缩结果,认为潜在思维蕴含了关键上下文知识与推理步骤,这些对于数据高效学习至关重要。我们通过数学领域的数据受限持续预训练,实证展示了该方法的有效性。首先,我们证明采用合成数据方法推断潜在思维能大幅提升数据效率,在相同数量原始数据上的训练效果更优(MATH数据集上准确率从5.7%提升至25.4%)。进一步,我们展示了无需强教师模型的潜在思维推断,其中LM通过EM算法迭代提升自身能力及思维增强预训练数据的质量。实验表明,一个10亿参数的LM能够在至少三次迭代中自举其性能,显著优于基于原始数据训练的基线模型,且在执行E步时,随着推断计算资源的增加,性能提升更为明显。推断规模扩展与EM迭代带来的增益,为数据受限下的预训练扩展开辟了新的机遇。
English
Compute scaling for language model (LM) pretraining has outpaced the growth
of human-written texts, leading to concerns that data will become the
bottleneck to LM scaling. To continue scaling pretraining in this
data-constrained regime, we propose that explicitly modeling and inferring the
latent thoughts that underlie the text generation process can significantly
improve pretraining data efficiency. Intuitively, our approach views web text
as the compressed final outcome of a verbose human thought process and that the
latent thoughts contain important contextual knowledge and reasoning steps that
are critical to data-efficient learning. We empirically demonstrate the
effectiveness of our approach through data-constrained continued pretraining
for math. We first show that synthetic data approaches to inferring latent
thoughts significantly improve data efficiency, outperforming training on the
same amount of raw data (5.7\% rightarrow 25.4\% on MATH). Furthermore, we
demonstrate latent thought inference without a strong teacher, where an LM
bootstraps its own performance by using an EM algorithm to iteratively improve
the capability of the trained LM and the quality of thought-augmented
pretraining data. We show that a 1B LM can bootstrap its performance across at
least three iterations and significantly outperform baselines trained on raw
data, with increasing gains from additional inference compute when performing
the E-step. The gains from inference scaling and EM iterations suggest new
opportunities for scaling data-constrained pretraining.Summary
AI-Generated Summary