단일 스트림 정책 최적화

초록

우리는 대규모 언어 모델(LLM)을 위한 정책 경사 최적화를 단일 스트림 관점에서 재검토한다. GRPO와 같은 기존의 그룹 기반 방법은 실시간 기준선을 사용하여 분산을 줄이지만, 빈번한 퇴화 그룹으로 인해 학습 신호가 소실되고 동기화 장벽이 확장성을 저해하는 치명적인 결점을 가지고 있다. 우리는 이러한 문제를 설계적으로 해결하는 단일 스트림 정책 최적화(SPO)를 소개한다. SPO는 그룹별 기준선을 지속적이고 KL-적응형 값 추적기로 대체하며, 배치 전반에 걸쳐 이점을 전역적으로 정규화하여 모든 샘플에 대해 안정적이고 낮은 분산의 학습 신호를 제공한다. 그룹이 필요 없는 SPO는 생성 시간이 다양한 장기적 또는 도구 통합 설정에서 더 높은 처리량과 효과적인 확장성을 가능하게 한다. 또한, 지속적인 값 추적기는 우선순위 샘플링을 통해 적응형 커리큘럼을 자연스럽게 가능하게 한다. Qwen3-8B를 사용한 실험에서 SPO는 GRPO보다 더 부드럽게 수렴하고 더 높은 정확도를 달성하며, 퇴화 그룹에 낭비되는 계산을 제거한다. 제거 연구는 SPO의 이점이 기준선 추정과 이점 정규화에 대한 원칙적인 접근에서 비롯되며, LLM 추론을 위한 더 견고하고 효율적인 경로를 제공함을 확인한다. Qwen3 8B를 사용한 다섯 가지 어려운 수학 벤치마크에서 SPO는 GRPO 대비 평균 maj@32을 +3.4%p 향상시켰으며, BRUMO 25에서 +7.3%p, AIME 25에서 +4.4%p, HMMT 25에서 +3.3%p와 같은 도전적인 데이터셋에서 상당한 절대 점수 상승을 이끌었다. 또한 평가된 k 값 전반에 걸쳐 pass@k에서 일관된 상대적 이득을 달성했다. SPO의 성공은 RL 알고리즘에 부수적인 복잡성을 추가하는 기존의 경향에 도전하며, 아키텍처적 우회책이 아닌 근본적인 원칙이 LLM 추론의 다음 발전을 이끌어갈 길을 강조한다.

English

We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@k across the evaluated k values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.