SPPO: 장기 추론 작업을 위한 시퀀스 수준 PPO

초록

근접 정책 최적화(PPO)는 검증 가능한 보상이 있는 추론 과제에서 대규모 언어 모델(LLM) 정렬의 핵심 기술입니다. 그러나 표준 토큰 수준 PPO는 긴 사고 연쇄(CoT) 과정에서의 시간적 신용 할당 불안정성과 가치 모델의 과도한 메모리 비용으로 인해 이러한 환경에서 어려움을 겪습니다. GRPO와 같은 비판가 없는 대안들이 이러한 문제를 완화하지만, 기준선 추정을 위해 다중 샘플링을 필요로 하여 상당한 계산 부담이 발생하고 훈련 처리량이 크게 제한됩니다. 본 논문에서는 PPO의 샘플 효율성과 결과 기반 업데이트의 안정성을 조화시킨 확장 가능한 알고리즘인 시퀀스 수준 PPO(SPPO)를 소개합니다. SPPO는 추론 과정을 시퀀스 수준 맥락적 밴딧 문제로 재정의하고, 분리된 스칼라 가치 함수를 활용하여 다중 샘플링 없이도 낮은 분산을 가진 이점 신호를 도출합니다. 수학적 벤치마크에서의 폭넓은 실험을 통해 SPPO가 표준 PPO를 크게 능가하며 계산 집약적인 그룹 기반 방법들의 성능에 필적함을 입증하여, 추론 LLM 정렬을 위한 자원 효율적인 프레임워크를 제공합니다.

English

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.

SPPO: 장기 추론 작업을 위한 시퀀스 수준 PPO

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

초록

Support