ChatPaper.aiChatPaper

**FAPO:基于缺陷感知的策略优化方法——实现高效可靠推理的新途径**

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

October 26, 2025
作者: Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Xin Liu, Min Zhang
cs.AI

摘要

基於可驗證獎勵的強化學習(RLVR)已成為增強大型語言模型(LLM)推理能力的重要範式。該方法通過探索推理軌跡,並將得出正確答案的推演過程作為策略優化的正向信號。然而,這些推演可能包含答案猜測和跳躍式推理等缺陷模式。這類存在缺陷的正向推演與完全正確的推演獲得相同獎勵,導致策略模型內化不可靠的推理模式。本研究首先系統性分析強化學習中的缺陷正向推演,發現其在優化初期能快速提升能力,但後期會因強化不可靠模式而限制推理能力。基於此,我們提出缺陷感知策略優化(FAPO),通過對缺陷正向推演實施無參數的獎勵懲罰,使策略在熱身階段將其作為有效捷徑以確保穩定收益,並在後期精煉階段逐步轉向可靠推理優化。為精準全面檢測缺陷正向推演,我們引入具備過程級獎勵的生成式獎勵模型(GenRM),可準確定位推理錯誤。實驗表明,FAPO在多領域均能有效提升結果正確性、過程可靠性及訓練穩定性,且無需增加標記預算。
English
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models (LLMs). In this context, models explore reasoning trajectories and exploit rollouts with correct answers as positive signals for policy optimization. However, these rollouts might involve flawed patterns such as answer-guessing and jump-in-reasoning. Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns. In this work, we first conduct a systematic study of flawed-positive rollouts in RL and find that they enable rapid capability gains during the early optimization stage, while constraining reasoning capability later by reinforcing unreliable patterns. Building on these insights, we propose Flawed-Aware Policy Optimization (FAPO), which presents a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage, securing stable early gains, while gradually shifting optimization toward reliable reasoning in the later refinement stage. To accurately and comprehensively detect flawed-positive rollouts, we introduce a generative reward model (GenRM) with a process-level reward that precisely localizes reasoning errors. Experiments show that FAPO is effective in broad domains, improving outcome correctness, process reliability, and training stability without increasing the token budget.
PDF101December 2, 2025