セグメントポリシー最適化：大規模言語モデルにおける強化学習のための効果的なセグメントレベル信用割り当て

要旨

大規模言語モデルの推論能力を強化学習（RL）を用いて効果的に向上させることは、依然として重要な課題である。既存のアプローチは主に2つの対照的な利点推定粒度を採用している。トークンレベル手法（例：PPO）は、細かい粒度の利点信号を提供することを目指すが、正確な批評家モデルの訓練が困難であるため、推定が不正確になる。一方、軌跡レベル手法（例：GRPO）は、最終報酬からの粗い粒度の利点信号にのみ依存し、信用割り当てが不正確になる。これらの制限を解決するため、我々は中間粒度のセグメントレベル利点推定を活用する新しいRLフレームワークであるSegment Policy Optimization（SPO）を提案する。SPOは、軌跡レベル手法よりも正確な信用割り当てを提供し、トークンレベル手法よりも少ない推定点を必要とするため、批評家モデルなしでモンテカルロ（MC）に基づく正確な利点推定を可能にする。SPOは、以下の3つの新戦略を特徴とするコンポーネントを備えている：（1）柔軟なセグメント分割、（2）正確なセグメント利点推定、（3）セグメント利点を用いたポリシー最適化（新たな確率マスク戦略を含む）。さらに、SPOを2つの特定のシナリオに具体化した：（1）短い連鎖思考（CoT）のためのSPO-chainは、新たなカットポイントベースの分割と連鎖ベースの利点推定を特徴とし、GSM8KにおいてPPOおよびGRPOよりも6-12パーセントポイントの精度向上を達成した。（2）長いCoTのためのSPO-treeは、新たなツリーベースの利点推定を特徴とし、MC推定のコストを大幅に削減し、MATH500において2Kおよび4Kコンテキスト評価でGRPOよりも7-11パーセントポイントの精度向上を達成した。我々はコードをhttps://github.com/AIFrameResearch/SPOで公開している。

English

Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving 6-12 percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving 7-11 percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.

セグメントポリシー最適化：大規模言語モデルにおける強化学習のための効果的なセグメントレベル信用割り当て

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

要旨

Support