SPPO:面向長程推理任務的序列級PPO
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
April 10, 2026
作者: Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen
cs.AI
摘要
近端策略優化(PPO)在對齊大型語言模型於具可驗證獎勵的推理任務中具有核心地位。然而,標準的詞元級PPO在此情境下面臨挑戰,原因在於長鏈思維(CoT)跨度上時序信用分配的不穩定性,以及價值模型難以承受的記憶體成本。雖然如GRPO這類無評論器方法能緩解這些問題,但它們需要多次採樣進行基線估計,導致顯著的計算開銷,嚴重限制訓練吞吐量。本文提出序列級PPO(SPPO),這一可擴展算法將PPO的採樣效率與基於結果更新的穩定性相結合。SPPO將推理過程重新定義為序列級情境賭博問題,採用解耦的純量價值函數來獲取低方差優勢信號,無需多重採樣。在數學基準上的大量實驗表明,SPPO顯著超越標準PPO,並能媲美計算密集型的分組方法,為對齊推理型大型語言模型提供了一個資源高效的框架。
English
Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.