SPPO：面向長程推理任務的序列級PPO

摘要

近端策略優化（PPO）在對齊大型語言模型於具可驗證獎勵的推理任務中具有核心地位。然而，標準的詞元級PPO在此情境下面臨挑戰，原因在於長鏈思維（CoT）跨度上時序信用分配的不穩定性，以及價值模型難以承受的記憶體成本。雖然如GRPO這類無評論器方法能緩解這些問題，但它們需要多次採樣進行基線估計，導致顯著的計算開銷，嚴重限制訓練吞吐量。本文提出序列級PPO（SPPO），這一可擴展算法將PPO的採樣效率與基於結果更新的穩定性相結合。SPPO將推理過程重新定義為序列級情境賭博問題，採用解耦的純量價值函數來獲取低方差優勢信號，無需多重採樣。在數學基準上的大量實驗表明，SPPO顯著超越標準PPO，並能媲美計算密集型的分組方法，為對齊推理型大型語言模型提供了一個資源高效的框架。

English

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.

SPPO：面向長程推理任務的序列級PPO

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

摘要

Support