ESPO: 早期打ち切り近接方策最適化

要旨

強化学習下の大規模言語モデルが軌道の初期段階で誤った推論ステップを踏んだ場合、標準的なアルゴリズムでは最大ホライゾンまで生成を強制し、正の報酬を得られないトークンに計算資源を費やし、失敗後のノイズでアドバンテージ推定を汚染する。本論文では、軌道の失敗をオンザフライで検出しロールアウトを早期終了するESPO（早期停止近位政策最適化）を提案する。ESPOは各生成ステップにおいて、サンプリング時に既に計算されたロジットのみを用いて代理後悔を計算し、平滑化累積後悔が推定値を有意に上回った時点で生成を停止する。打ち切られた軌道は終端報酬を伴う吸収失敗状態として扱われ、追加の報酬モデルや人間によるアノテーションを必要とせずに、検出された失敗ステップ付近に負の時間的誤差を集中させる。数学的推論用に学習されたDeepSeek-R1-Distill-Qwen-7Bにおいて、ESPOはAIME 2024（46.28%対45.25%）、AMC 2023（85.83%対82.94%）、MATH-500（87.42%対85.43%）でPPOを上回り、累積ロールアウトトークンを20%以上削減する。

English

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.