FIPO：基于未来KL散度影响策略优化的深度推理激发方法

摘要

我们提出未来KL散度影响策略优化（FIPO），一种旨在突破大语言模型推理瓶颈的强化学习算法。虽然GRPO式训练具备良好的扩展性，但其通常依赖基于结果的奖励模型（ORM），将全局优势均匀分配给轨迹中的每个标记。我们认为这种粗粒度的信用分配未能区分关键逻辑支点与普通标记，从而形成了性能瓶颈。FIPO通过将折现后的未来KL散度融入策略更新，构建出能根据标记对后续轨迹行为影响力进行重加权的密集优势公式。实证表明，FIPO能使模型突破标准基线中出现的长度停滞现象。在Qwen2.5-32B上的评估显示，FIPO将平均思维链长度从约4000标记延伸至超10000标记，并将AIME 2024 Pass@1准确率从50.0%提升至峰值58.0%（最终收敛于约56.0%）。该结果优于DeepSeek-R1-Zero-Math-32B（约47.0%）和o1-mini（约56.0%）。我们的研究表明，建立密集优势公式是演进基于ORM的算法、释放基础模型完整推理潜力的关键路径。我们基于verl框架构建的训练系统已开源。

English

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

FIPO：基于未来KL散度影响策略优化的深度推理激发方法

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

摘要

Support