ESPO: 조기 종료 근접 정책 최적화

초록

강화 학습 하에서 대규모 언어 모델이 궤적 초기에 잘못된 추론 단계를 수행할 때, 표준 알고리즘은 최대 수평선까지 생성을 강제하여 긍정적 보상을 받지 못하는 토큰에 연산을 소비하고, 실패 후 잡음으로 인해 이점 추정치를 오염시킵니다. 본 논문에서는 궤적 실패를 실시간으로 감지하고 롤아웃을 조기 종료하는 ESPO(Early-Stopping Proximal Policy Optimization)를 제안합니다. 각 생성 단계에서 ESPO는 샘플링 중 이미 계산된 로짓만을 사용하여 대리 후회(surrogate regret)를 계산하고, 평활화된 누적 후회가 추정치를 크게 초과할 때 종료합니다. 잘린 궤적은 종료 보상을 갖는 흡수 실패 상태로 처리되어, 추가적인 보상 모델이나 인간 주석 없이도 감지된 실패 단계 근처에 음의 시간차 오차를 집중시킵니다. 수학적 추론을 위해 훈련된 DeepSeek-R1-Distill-Qwen-7B에서 ESPO는 PPO를 능가하는 성능을 보였습니다(AIME~2024: 46.28% 대 45.25%, AMC~2023: 85.83% 대 82.94%, MATH-500: 87.42% 대 85.43%). 동시에 누적 롤아웃 토큰을 20% 이상 절약합니다.

English

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.