分段策略優化:大型語言模型強化學習中的有效分段級別信用分配
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
May 29, 2025
作者: Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu
cs.AI
摘要
提升大型語言模型的推理能力,利用強化學習(RL)仍是一項關鍵挑戰。現有方法主要採用兩種對比的優勢估計粒度:詞元級方法(如PPO)旨在提供細粒度的優勢信號,但由於訓練精確的評論模型困難,導致估計不準確。另一方面,軌跡級方法(如GRPO)僅依賴於最終獎勵的粗粒度優勢信號,導致信用分配不精確。為解決這些限制,我們提出了分段策略優化(SPO),這是一種新穎的RL框架,利用中級粒度的分段級優勢估計,實現了比軌跡級方法更精確的信用分配,並比詞元級方法需要更少的估計點,從而基於蒙特卡洛(MC)實現無需評論模型的精確優勢估計。SPO包含三個具有新策略的組件:(1)靈活的分段劃分;(2)精確的分段優勢估計;(3)使用分段優勢的策略優化,包括新穎的概率掩碼策略。我們進一步為兩種特定場景實例化SPO:(1)SPO-chain用於短鏈式推理(CoT),具有基於切點的分劃和基於鏈的優勢估計,在GSM8K上比PPO和GRPO提高了6-12個百分點的準確率。(2)SPO-tree用於長CoT,具有基於樹的優勢估計,顯著降低了MC估計的成本,在MATH500的2K和4K上下文評估中比GRPO提高了7-11個百分點。我們將代碼公開於https://github.com/AIFrameResearch/SPO。
English
Enhancing the reasoning capabilities of large language models effectively
using reinforcement learning (RL) remains a crucial challenge. Existing
approaches primarily adopt two contrasting advantage estimation granularities:
Token-level methods (e.g., PPO) aim to provide the fine-grained advantage
signals but suffer from inaccurate estimation due to difficulties in training
an accurate critic model. On the other extreme, trajectory-level methods (e.g.,
GRPO) solely rely on a coarse-grained advantage signal from the final reward,
leading to imprecise credit assignment. To address these limitations, we
propose Segment Policy Optimization (SPO), a novel RL framework that leverages
segment-level advantage estimation at an intermediate granularity, achieving a
better balance by offering more precise credit assignment than trajectory-level
methods and requiring fewer estimation points than token-level methods,
enabling accurate advantage estimation based on Monte Carlo (MC) without a
critic model. SPO features three components with novel strategies: (1) flexible
segment partition; (2) accurate segment advantage estimation; and (3) policy
optimization using segment advantages, including a novel probability-mask
strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain
for short chain-of-thought (CoT), featuring novel cutpoint-based partition and
chain-based advantage estimation, achieving 6-12 percentage point
improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT,
featuring novel tree-based advantage estimation, which significantly reduces
the cost of MC estimation, achieving 7-11 percentage point improvements
over GRPO on MATH500 under 2K and 4K context evaluation. We make our code
publicly available at https://github.com/AIFrameResearch/SPO.