ESPO：早停近端策略優化

摘要

當大型語言模型在強化學習過程中，於早期推理步驟出現錯誤時，標準演算法會強迫其持續生成至最大時間步長，耗費計算資源於永遠無法獲得正向獎勵的詞元，並使優勢估計受後續失敗雜訊污染。我們提出ESPO（早停式近端策略優化），該方法能即時偵測軌跡失敗並提前終止展開。在每個生成步驟中，ESPO僅利用取樣時已計算出的對數幾率計算代理遺憾值，當平滑累積遺憾值顯著超過其估計值時即終止生成。被截斷的軌跡視為帶有終端獎勵的吸收失敗狀態，使負向時序差分誤差集中於偵測到的失敗步驟附近，無需額外獎勵模型或人工標註。在針對數學推理訓練的DeepSeek-R1-Distill-Qwen-7B模型上，ESPO在AIME 2024（46.28%對45.25%）、AMC 2023（85.83%對82.94%）及MATH-500（87.42%對85.43%）的表現皆優於PPO，同時累計節省超過20%的展開詞元。

English

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.