シングルストリーム方策最適化

要旨

大規模言語モデル（LLM）のポリシー勾配最適化を、シングルストリームの観点から再検討する。GRPOのような主流のグループベース手法は、オンザフライのベースラインを用いて分散を低減するが、重大な欠点を抱えている。頻繁に発生する縮退グループが学習信号を消滅させ、同期バリアがスケーラビリティを妨げるのだ。本論文では、これらの問題を設計上排除したシングルストリーム・ポリシー最適化（SPO）を提案する。SPOは、グループごとのベースラインを永続的でKL適応型のバリュートラッカーに置き換え、バッチ全体でグローバルにアドバンテージを正規化することで、各サンプルに対して安定した低分散の学習信号を提供する。グループフリーであるため、SPOはより高いスループットを実現し、生成時間が変動する長期視野やツール統合環境において効果的にスケールする。さらに、永続的なバリュートラッカーは、優先サンプリングによる適応型カリキュラムを自然に可能にする。Qwen3-8Bを用いた実験では、SPOはGRPOよりも滑らかに収束し、より高い精度を達成するとともに、縮退グループに費やされる計算リソースを排除する。アブレーション研究により、SPOの利点がベースライン推定とアドバンテージ正規化に対する原理に基づいたアプローチに起因することが確認され、LLM推論におけるより堅牢で効率的な道筋が示される。Qwen3 8Bを用いた5つの難易度の高い数学ベンチマークにおいて、SPOはGRPOに対して平均maj@32を+3.4パーセンテージポイント（pp）向上させ、BRUMO 25では+7.3 pp、AIME 25では+4.4 pp、HMMT 25では+3.3 ppという顕著な絶対ポイントの向上を達成し、評価されたk値全体でpass@kにおいて一貫した相対的向上を実現する。SPOの成功は、RLアルゴリズムに付随的な複雑性を追加するという主流のトレンドに挑戦し、アーキテクチャ的な回避策ではなく基本原理がLLM推論の次の進化を牽引する道筋を示している。

English

We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@k across the evaluated k values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.