思维增强预训练

摘要

本文提出了一种简单且可扩展的方法，通过为现有文本数据增添思维轨迹来提高大规模语言模型（LLM）训练的数据效率。预训练LLM所需的计算资源正以前所未有的速度增长，而高质量数据的可用性却依然有限。因此，如何最大化现有数据的效用构成了一个重要的研究挑战。一个主要障碍在于，在模型容量固定的情况下，某些高质量标记难以学习，因为单个标记背后的逻辑可能异常复杂且深奥。为解决这一问题，我们提出了思维增强预训练（TPT），这是一种通用方法，通过自动生成的思维轨迹来扩充文本。这种扩充有效增加了训练数据的规模，并通过逐步推理和分解使高质量标记更易于学习。我们在多达1000亿标记的多样化训练配置中应用了TPT，包括数据受限和充足情况下的预训练，以及从强大的开源检查点进行的中期训练。实验结果表明，我们的方法显著提升了不同规模和系列的LLM性能。值得注意的是，TPT将LLM预训练的数据效率提高了3倍。对于一个30亿参数的模型，它在多个具有挑战性的推理基准上，使训练后性能提升了超过10%。

English

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to 100B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of 3. For a 3B parameter model, it improves the post-training performance by over 10% on several challenging reasoning benchmarks.

思维增强预训练

Thinking Augmented Pre-training

摘要

Support