ChatPaper.aiChatPaper

通过潜在思维进行推理学习

Reasoning to Learn from Latent Thoughts

March 24, 2025
作者: Yangjun Ruan, Neil Band, Chris J. Maddison, Tatsunori Hashimoto
cs.AI

摘要

语言模型(LM)预训练的计算规模增长速度已超越人类书写文本的增长,这引发了数据可能成为LM扩展瓶颈的担忧。为了在数据受限的情况下继续推进预训练,我们提出,通过显式建模并推断文本生成过程背后的潜在思维,可以显著提升预训练的数据效率。直观上,我们的方法将网络文本视为冗长人类思维过程的压缩结果,认为潜在思维蕴含了关键上下文知识与推理步骤,这些对于数据高效学习至关重要。我们通过数学领域的数据受限持续预训练,实证展示了该方法的有效性。首先,我们证明采用合成数据方法推断潜在思维能大幅提升数据效率,在相同数量原始数据上的训练效果更优(MATH数据集上准确率从5.7%提升至25.4%)。进一步,我们展示了无需强教师模型的潜在思维推断,其中LM通过EM算法迭代提升自身能力及思维增强预训练数据的质量。实验表明,一个10亿参数的LM能够在至少三次迭代中自举其性能,显著优于基于原始数据训练的基线模型,且在执行E步时,随着推断计算资源的增加,性能提升更为明显。推断规模扩展与EM迭代带来的增益,为数据受限下的预训练扩展开辟了新的机遇。
English
Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the latent thoughts that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency, outperforming training on the same amount of raw data (5.7\% rightarrow 25.4\% on MATH). Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM bootstraps its own performance by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.

Summary

AI-Generated Summary

PDF131March 25, 2025