單一流策略優化

摘要

我們從單一流程的角度重新審視了大型語言模型（LLMs）的策略梯度優化。現行的基於群組的方法，如GRPO，雖然通過即時基線降低了方差，但存在關鍵缺陷：頻繁出現的退化群組會抹去學習信號，而同步屏障則阻礙了可擴展性。我們引入了單一流程策略優化（SPO），從設計上消除了這些問題。SPO用一個持久的、KL自適應的值追踪器取代了每組基線，並在整個批次中全局標準化優勢，為每個樣本提供了穩定、低方差的學習信號。由於無需分組，SPO在生成時間變化的長時程或工具集成設置中實現了更高的吞吐量和有效的擴展。此外，持久的值追踪器自然支持通過優先採樣實現的自適應課程。使用Qwen3-8B的實驗表明，SPO比GRPO收斂更平穩，並達到更高的準確性，同時消除了在退化群組上浪費的計算。消融研究證實，SPO的增益源於其對基線估計和優勢標準化的原則性方法，為LLM推理提供了一條更穩健和高效的路徑。在Qwen3 8B的五個困難數學基準測試中，SPO將平均maj@32提高了+3.4個百分點（pp），這主要得益於在具有挑戰性的數據集上的顯著絕對分數提升，包括BRUMO 25上的+7.3 pp、AIME 25上的+4.4 pp、HMMT 25上的+3.3 pp，並在評估的k值上實現了pass@k的一致相對增益。SPO的成功挑戰了當前在強化學習算法中添加附帶複雜性的趨勢，強調了一條由基本原理而非架構變通驅動的LLM推理進步之路。

English

We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@k across the evaluated k values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.