FIPO: 未来-KL影響下の政策最適化による深層推論の誘発

要旨

本論文では、大規模言語モデルの推論ボトルネックを克服する強化学習アルゴリズム「Future-KL Influenced Policy Optimization (FIPO)」を提案する。GRPOスタイルの学習は効果的にスケールするが、通常は軌跡内の全てのトークンにグローバルなアドバンテージを均一に分配する結果ベースの報酬モデル(ORM)に依存している。我々は、この粗粒度な信用分配が重要な論理的転換点と些末なトークンを区別できないため、性能限界を課していると論じる。FIPOは方針更新に割引未来KLダイバージェンスを組み込むことでこの問題に対処し、後続の軌跡行動への影響力に基づいてトークンを再重み付けする高密度アドバンテージ定式化を実現する。実験では、FIPOが標準ベースラインで見られる長さの停滞を打破できることを示す。Qwen2.5-32Bでの評価において、FIPOは平均的な連鎖思考の長さを約4,000トークンから10,000トークン以上に拡大し、AIME 2024 Pass@1精度を50.0%からピーク58.0%（約56.0%で収束）に向上させた。これはDeepSeek-R1-Zero-Math-32B（約47.0%）とo1-mini（約56.0%）の両方を上回る性能である。我々の結果は、高密度アドバンテージ定式化の確立が、ベースモデルの完全な推論潜在能力を解放するORMベースアルゴリズム進化の重要経路であることを示唆する。verlフレームワーク上に構築した学習システムをオープンソースとして公開する。

English

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

FIPO: 未来-KL影響下の政策最適化による深層推論の誘発

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

要旨

Support