強化学習事前学習

要旨

本研究では、大規模言語モデルと強化学習（RL）のための新しいスケーリングパラダイムとして、Reinforcement Pre-Training（RPT）を提案します。具体的には、次のトークン予測をRLを用いて訓練する推論タスクとして再定義し、与えられたコンテキストに対して次のトークンを正しく予測することで検証可能な報酬を受け取るようにします。RPTは、ドメイン固有の注釈付き回答に依存するのではなく、大量のテキストデータを汎用的なRLに活用するスケーラブルな方法を提供します。次のトークン推論能力を促進することで、RPTは次のトークンを予測する言語モデリングの精度を大幅に向上させます。さらに、RPTはさらなる強化学習のファインチューニングのための強力な事前学習基盤を提供します。スケーリング曲線は、訓練計算量を増やすことで次のトークン予測精度が一貫して向上することを示しています。これらの結果から、RPTは言語モデルの事前学習を進めるための効果的で有望なスケーリングパラダイムとして位置づけられます。

English

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.