SPPO:面向长程推理任务的序列级近端策略优化
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
April 10, 2026
作者: Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen
cs.AI
摘要
近端策略优化(PPO)在基于可验证奖励的大语言模型推理任务对齐中具有核心地位。然而,标准令牌级PPO在此场景下面临挑战:长思维链跨度下的时序信用分配不稳定,且价值模型的内存开销过高。虽然如GRPO这类无评论器方案能缓解上述问题,但它们需要通过多次采样进行基线估计,导致显著的计算开销,严重制约训练吞吐量。本文提出序列级PPO(SPPO),这一可扩展算法将PPO的样本效率与基于结果更新的稳定性相融合。SPPO将推理过程重构为序列级上下文赌博机问题,采用解耦的标量价值函数来获取低方差优势信号,无需多重采样。在数学基准测试上的大量实验表明,SPPO显著优于标准PPO,并与计算密集型分组方法的性能相当,为推理大语言模型的对齐提供了资源高效的框架。
English
Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.