ExpRL: LLM中間訓練のための探索的強化学習

要旨

スパース報酬強化学習は、LLMの推論能力を向上させる標準的な手法となっているが、その成功はベースモデルに存在するカバレッジに決定的に依存する。実際には、モデルはしばしば、分解、検証、自己修正などの有用なプリミティブスキルを教えるために厳選された推論トレースを用いた中間訓練によって、RLの準備が行われる。効果的ではあるものの、この戦略ではモデルが何を学習すべきかを人手で指定する必要があり、より困難な問題（これらのスキルを組み合わせてより広範な解法戦略にする必要がある問題）に対して、そのようなプリミティブカバレッジで十分かどうかは不明である。本研究では、より自動化されたアプローチ、すなわち人間が作成した大規模な質問応答データを用いたRLベースの中間訓練を検討する。我々の手法であるExpRLは、参照解を模倣すべき目標として扱うのではなく、報酬スキャフォールドとして利用する。すなわち、参照解は方策から隠蔽され、オン方策推論トレースを評価するための問題固有の採点ルーブリックを構築するためにのみ使用される。方策は元の問題プロンプトからサンプリングを行い、LLM判定器がサンプリングされた推論トレースを参照解と比較し、結果レベルまたはプロセスレベルの密な報酬を割り当てる。これにより、ExpRLは部分的な進捗、有用な中間的な削減、そして最終回答がスパースな報酬ではしばしば適切に重み付けできない生産的な推論行動を強化することができる。難しい数学的推論タスクにおいて、ExpRLはSFT、スパース報酬GRPO、自己蒸留よりも強力なRLプライミングを実現し、その後のスパース報酬RLのためのより優れた初期化を提供する。さらに、混合ドメインの追加実験は、ExpRLが元の数学のみの設定を超えて拡張可能であることを示唆している。

English

Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through mid-training on curated reasoning traces that teach useful primitive skills such as decomposition, verification, or self-correction. Although effective, this strategy requires manually specifying what the model should learn, and it remains unclear whether such primitive coverage is enough for much harder problems, which require combining these skills into broader solution strategies. We study a more automated approach: RL-based mid-training using large corpora of human-written question-answer data. Rather than treating reference solutions as targets to imitate, our method, ExpRL, uses them as reward scaffolds: references are hidden from the policy and used only to construct problem-specific grading rubrics for judging on-policy reasoning traces. The policy samples from the original problem prompt, while an LLM judge compares the sampled reasoning trace against the reference solution and assigns outcome-level or process-level dense rewards. This lets ExpRL reinforce partial progress, useful intermediate reductions, and productive reasoning behaviors that sparse final-answer rewards often fail to upweight. On challenging math reasoning tasks, ExpRL yields stronger RL priming than SFT, sparse-reward GRPO, and self-distillation, and provides a better initialization for subsequent sparse-reward RL. Additional mixed-domain experiments further suggest that ExpRL can extend beyond the original math-only setting.