单流策略优化

摘要

我们重新审视了从单流视角出发的大语言模型（LLM）策略梯度优化方法。当前主流的基于群体的方法，如GRPO，通过即时基线减少方差，但存在关键缺陷：频繁出现的退化群体抹去了学习信号，同步障碍阻碍了可扩展性。我们提出了单流策略优化（SPO），从设计上消除了这些问题。SPO用持久且KL自适应的价值追踪器替代了每群体基线，并在批次内全局归一化优势，为每个样本提供了稳定、低方差的学习信号。由于无需群体划分，SPO在生成时间变化的长时程或工具集成场景中实现了更高的吞吐量和有效扩展。此外，持久价值追踪器自然支持通过优先级采样实现自适应课程学习。使用Qwen3-8B的实验表明，SPO比GRPO收敛更平稳，达到更高的准确率，同时消除了在退化群体上浪费的计算。消融研究证实，SPO的增益源于其基线估计和优势归一化的原则性方法，为LLM推理提供了更稳健高效的路径。在Qwen3 8B上进行的五个高难度数学基准测试中，SPO相较于GRPO将平均maj@32提高了3.4个百分点（pp），这得益于在挑战性数据集上的显著绝对分数提升，包括BRUMO 25上的+7.3 pp，AIME 25上的+4.4 pp，HMMT 25上的+3.3 pp，并在评估的k值范围内实现了pass@k的一致相对增益。SPO的成功挑战了当前强化学习算法中增加附带复杂性的趋势，指明了一条以基本原则而非架构变通推动LLM推理下一波进步的道路。

English

We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@k across the evaluated k values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.