ChatPaper.aiChatPaper

强化预训练

Reinforcement Pre-Training

June 9, 2025
作者: Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei
cs.AI

摘要

在本研究中,我们提出了强化预训练(Reinforcement Pre-Training, RPT)作为一种面向大规模语言模型与强化学习(RL)的新型扩展范式。具体而言,我们将下一词预测重新定义为一项通过RL训练的逻辑推理任务,模型在给定上下文中正确预测下一词时获得可验证的奖励。RPT提供了一种可扩展的方法,利用海量文本数据进行通用目的强化学习,而非依赖特定领域的标注答案。通过激励下一词推理能力,RPT显著提升了语言模型在预测下一词时的准确性。此外,RPT为后续的强化微调奠定了强大的预训练基础。扩展曲线表明,增加训练计算资源持续提升了下一词预测的准确度。这些结果确立了RPT作为一种有效且前景广阔的扩展范式,推动了语言模型预训练的发展。
English
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
PDF18515June 10, 2025