ESPO：早停近端策略优化

摘要

在强化学习下的大语言模型若在生成轨迹早期出现推理错误，标准算法会强制其继续生成直至最大步数上限，这不仅将计算耗费在永远无法获得正奖励的token上，还会因失败后的噪声污染优势估计。为此，我们提出ESPO（早停式近端策略优化），该方法能在生成过程中实时检测轨迹失败并提前终止。在每一步生成中，ESPO仅利用采样时已计算出的logits计算替代遗憾，当平滑累积遗憾显著超过其估计值时即终止生成。被截断的轨迹视为带有终止奖励的吸收失败状态，使得负时序差分误差集中在检测到的失败步骤附近，无需额外的奖励模型或人工标注。在基于DeepSeek-R1-Distill-Qwen-7B训练的数学推理任务中，ESPO在AIME 2024（46.28% vs. 45.25%）、AMC 2023（85.83% vs. 82.94%）和MATH-500（87.42% vs. 85.43%）上均超越PPO，同时累计节省超过20%的rollout tokens。

English

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.