SPPO: 長期的推論タスクのためのシーケンスレベルPPO

要旨

近接方策最適化（PPO）は、検証可能な報酬を伴う推論タスクにおける大規模言語モデル（LLM）のアライメントにおいて中心的な役割を果たしている。しかし、標準的なトークンレベルのPPOは、長い思考連鎖（CoT）の時間軸にわたる一時的信用割り当ての不安定性と、価値モデルの膨大なメモリコストにより、この設定では困難に直面する。GRPOのような批評家を必要としない代替手法はこれらの問題を緩和するが、ベースライン推定のために複数のサンプルを必要とするため計算コストが大幅に増加し、学習スループットを著しく制限する。本論文では、PPOのサンプル効率と結果ベースの更新の安定性を調和させたスケーラブルなアルゴリズムであるシーケンスレベルPPO（SPPO）を提案する。SPPOは推論プロセスをシーケンスレベルの文脈付きバンディット問題として再定式化し、分離されたスカラー価値関数を用いて多重サンプリングを必要としない低分散のアドバンテージ信号を導出する。数学的ベンチマークにおける大規模な実験により、SPPOが標準PPOを大幅に上回り、計算集約的なグループベース手法と同等の性能を達成することを実証し、推論LLMのアライメントにおけるリソース効率の高いフレームワークを提供する。

English

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.

SPPO: 長期的推論タスクのためのシーケンスレベルPPO

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

要旨

Support