ChatPaper.aiChatPaper

強化預訓練

Reinforcement Pre-Training

June 9, 2025
作者: Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei
cs.AI

摘要

在本研究中,我們引入了強化預訓練(Reinforcement Pre-Training, RPT)作為大型語言模型與強化學習(Reinforcement Learning, RL)的一種新型擴展範式。具體而言,我們將下一個詞元的預測重新定義為一項基於RL訓練的推理任務,模型在正確預測給定上下文的下一個詞元時會獲得可驗證的獎勵。RPT提供了一種可擴展的方法,能夠利用大量文本數據進行通用目的的強化學習,而非依賴於特定領域的標註答案。通過激勵下一個詞元推理的能力,RPT顯著提升了語言模型在預測下一個詞元時的準確性。此外,RPT為進一步的強化微調提供了強大的預訓練基礎。擴展曲線顯示,增加訓練計算量能持續提升下一個詞元的預測準確率。這些結果表明,RPT作為一種有效且前景廣闊的擴展範式,能夠推動語言模型預訓練的發展。
English
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
PDF19416June 10, 2025