分段策略优化:面向大语言模型强化学习的有效分段级信用分配
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
May 29, 2025
作者: Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu
cs.AI
摘要
有效提升大语言模型的推理能力,利用强化学习(RL)仍是一项关键挑战。现有方法主要采用两种对比鲜明的优势估计粒度:令牌级方法(如PPO)旨在提供细粒度的优势信号,但由于训练精确的批评模型困难,导致估计不准确。另一方面,轨迹级方法(如GRPO)仅依赖于最终奖励的粗粒度优势信号,导致信用分配不精确。为克服这些局限,我们提出了分段策略优化(SPO),一种新颖的RL框架,它利用中间粒度的分段级优势估计,在提供比轨迹级方法更精确的信用分配的同时,所需估计点少于令牌级方法,从而无需批评模型即可基于蒙特卡洛(MC)实现准确的优势估计。SPO包含三个创新策略的组件:(1)灵活的分段划分;(2)精确的分段优势估计;(3)利用分段优势进行策略优化,包括一种新颖的概率掩码策略。我们进一步将SPO实例化为两种具体场景:(1)SPO-chain用于短链式思维(CoT),采用基于切点的划分和链式优势估计,在GSM8K上相比PPO和GRPO实现了6-12个百分点的准确率提升。(2)SPO-tree用于长链式思维,采用基于树状的优势估计,显著降低了MC估计的成本,在MATH500的2K和4K上下文评估中,相比GRPO实现了7-11个百分点的提升。我们的代码已公开于https://github.com/AIFrameResearch/SPO。
English
Enhancing the reasoning capabilities of large language models effectively
using reinforcement learning (RL) remains a crucial challenge. Existing
approaches primarily adopt two contrasting advantage estimation granularities:
Token-level methods (e.g., PPO) aim to provide the fine-grained advantage
signals but suffer from inaccurate estimation due to difficulties in training
an accurate critic model. On the other extreme, trajectory-level methods (e.g.,
GRPO) solely rely on a coarse-grained advantage signal from the final reward,
leading to imprecise credit assignment. To address these limitations, we
propose Segment Policy Optimization (SPO), a novel RL framework that leverages
segment-level advantage estimation at an intermediate granularity, achieving a
better balance by offering more precise credit assignment than trajectory-level
methods and requiring fewer estimation points than token-level methods,
enabling accurate advantage estimation based on Monte Carlo (MC) without a
critic model. SPO features three components with novel strategies: (1) flexible
segment partition; (2) accurate segment advantage estimation; and (3) policy
optimization using segment advantages, including a novel probability-mask
strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain
for short chain-of-thought (CoT), featuring novel cutpoint-based partition and
chain-based advantage estimation, achieving 6-12 percentage point
improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT,
featuring novel tree-based advantage estimation, which significantly reduces
the cost of MC estimation, achieving 7-11 percentage point improvements
over GRPO on MATH500 under 2K and 4K context evaluation. We make our code
publicly available at https://github.com/AIFrameResearch/SPO.