FIPO:基於未來KL影響策略優化的深度推理誘發框架
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
March 20, 2026
作者: Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou
cs.AI
摘要
我們提出未來KL散度影響策略最佳化(FIPO),這是一種旨在突破大型語言模型推理瓶頸的強化學習演算法。雖然GRPO風格訓練能有效擴展,但其通常依賴基於結果的獎勵模型(ORM),將全域優勢均勻分配給軌跡中的每個詞元。我們認為這種粗粒度的功勞分配未能區分關鍵邏輯樞紐與次要詞元,從而形成了性能瓶頸。FIPO透過將折現後的未來KL散度納入策略更新,建立密集優勢公式,根據詞元對後續軌跡行為的影響力重新調整權重。實證顯示,FIPO能使模型突破標準基線中出現的長度停滯現象。在Qwen2.5-32B上的評估表明,FIPO將平均思維鏈長度從約4,000詞元擴展至超過10,000詞元,並將AIME 2024 Pass@1準確率從50.0%提升至峰值58.0%(最終收斂於約56.0%)。該表現優於DeepSeek-R1-Zero-Math-32B(約47.0%)和o1-mini(約56.0%)。我們的結果表明,建立密集優勢公式是推動基於ORM的演算法進化、釋放基礎模型完整推理潛力的關鍵路徑。我們將基於verl框架建構的訓練系統開源。
English
We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.