RLP: 事前学習目標としての強化学習

要旨

大規模な推論モデルを訓練するための主流のパラダイムは、膨大な量のデータを用いた次トークン予測損失による事前学習から始まります。強化学習は、推論のスケーリングにおいて強力ではあるものの、教師ありファインチューニングに続く、訓練の最終段階としてのみ導入されます。この主流の方法は果たして最適なのでしょうか？本論文では、RLP（情報駆動型強化学習事前学習目的関数）を提案します。これは、強化学習の核心である探索の精神を事前学習の最終段階に持ち込みます。鍵となるアイデアは、連鎖的思考（chain-of-thought）を探索的行動として扱い、将来のトークンを予測するための情報利得に基づいて報酬を計算することです。この訓練目的関数は、モデルが次に来るものを予測する前に自ら考えることを促し、事前学習の早い段階で独立した思考行動を教えます。具体的には、報酬信号は、文脈とサンプリングされた推論連鎖の両方を条件とした場合の次トークンの対数尤度の増加を、文脈のみを条件とした場合と比較して測定します。このアプローチは、検証器を必要としない密な報酬信号を生成し、事前学習中にドキュメントストリーム全体の効率的な訓練を可能にします。特に、RLPは推論のための強化学習を通常のテキストに対する事前学習目的関数として再構築し、次トークン予測と有用な連鎖的思考推論の出現との間のギャップを埋めます。Qwen3-1.7B-BaseにRLPを適用して事前学習を行うと、8つの数学・科学ベンチマークスイート全体の平均が19%向上します。同一の事後訓練を行った場合、特にAIME25やMMLU-Proのような推論が重要なタスクで最大の改善が見られます。ハイブリッドモデルであるNemotron-Nano-12B-v2にRLPを適用すると、全体の平均が42.81%から61.32%に上昇し、科学的推論の平均が23%向上し、アーキテクチャやモデルサイズを超えたスケーラビリティが実証されます。

English

The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.

RLP: 事前学習目標としての強化学習

RLP: Reinforcement as a Pretraining Objective

要旨

Support