FIPO：基於未來KL影響策略優化的深度推理誘發框架

摘要

我們提出未來KL散度影響策略最佳化（FIPO），這是一種旨在突破大型語言模型推理瓶頸的強化學習演算法。雖然GRPO風格訓練能有效擴展，但其通常依賴基於結果的獎勵模型（ORM），將全域優勢均勻分配給軌跡中的每個詞元。我們認為這種粗粒度的功勞分配未能區分關鍵邏輯樞紐與次要詞元，從而形成了性能瓶頸。FIPO透過將折現後的未來KL散度納入策略更新，建立密集優勢公式，根據詞元對後續軌跡行為的影響力重新調整權重。實證顯示，FIPO能使模型突破標準基線中出現的長度停滯現象。在Qwen2.5-32B上的評估表明，FIPO將平均思維鏈長度從約4,000詞元擴展至超過10,000詞元，並將AIME 2024 Pass@1準確率從50.0%提升至峰值58.0%（最終收斂於約56.0%）。該表現優於DeepSeek-R1-Zero-Math-32B（約47.0%）和o1-mini（約56.0%）。我們的結果表明，建立密集優勢公式是推動基於ORM的演算法進化、釋放基礎模型完整推理潛力的關鍵路徑。我們將基於verl框架建構的訓練系統開源。

English

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

FIPO：基於未來KL影響策略優化的深度推理誘發框架

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

摘要

Support