FAPO:面向高效可靠推理的缺陷感知策略优化
FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning
October 26, 2025
作者: Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Xin Liu, Min Zhang
cs.AI
摘要
可验证奖励的强化学习(RLVR)已成为增强大语言模型(LLMs)推理能力的重要范式。该框架下,模型通过探索推理轨迹,并将得出正确答案的推演过程作为策略优化的正向信号。然而,这些推演可能包含答案猜测、跳跃推理等缺陷模式。此类存在缺陷的正向推演与完全正确的推演获得相同奖励,导致策略模型内化这些不可靠的推理模式。本研究首先系统分析了强化学习中缺陷正向推演的影响,发现其在优化初期能快速提升能力,但后期会因强化不可靠模式而限制推理能力发展。基于此,我们提出缺陷感知策略优化(FAPO),通过无参数奖励惩罚机制,使策略在预热阶段将缺陷正向推演作为有效捷径以保障初期稳定收益,在后期优化阶段逐步转向可靠推理。为精准全面检测缺陷正向推演,我们引入具备过程级奖励的生成式奖励模型(GenRM),可精确定位推理错误。实验表明,FAPO在多个领域均能有效提升结果正确性、过程可靠性和训练稳定性,且无需增加计算开销。
English
Reinforcement learning with verifiable rewards (RLVR) has emerged as a
promising paradigm for enhancing the reasoning capabilities of large language
models (LLMs). In this context, models explore reasoning trajectories and
exploit rollouts with correct answers as positive signals for policy
optimization. However, these rollouts might involve flawed patterns such as
answer-guessing and jump-in-reasoning. Such flawed-positive rollouts are
rewarded identically to fully correct ones, causing policy models to
internalize these unreliable reasoning patterns. In this work, we first conduct
a systematic study of flawed-positive rollouts in RL and find that they enable
rapid capability gains during the early optimization stage, while constraining
reasoning capability later by reinforcing unreliable patterns. Building on
these insights, we propose Flawed-Aware Policy Optimization (FAPO), which
presents a parameter-free reward penalty for flawed-positive rollouts, enabling
the policy to leverage them as useful shortcuts in the warm-up stage, securing
stable early gains, while gradually shifting optimization toward reliable
reasoning in the later refinement stage. To accurately and comprehensively
detect flawed-positive rollouts, we introduce a generative reward model (GenRM)
with a process-level reward that precisely localizes reasoning errors.
Experiments show that FAPO is effective in broad domains, improving outcome
correctness, process reliability, and training stability without increasing the
token budget.