被忽视的后训练免费午餐：LLM智能体的进步优势

摘要

过程奖励模型能够对大型语言模型进行细粒度的分步评估，然而在智能体场景下构建此类模型仍极为困难：长程交互、不可逆操作以及随机环境反馈，使得人工标注和蒙特卡洛估计在大规模应用中均不可行。在本文中，我们证明强化学习后训练本身已具备实现有效步骤级评分所需的要素，从而完全无需额外训练专门的奖励模型。具体而言，我们推导出通用随机马尔可夫决策过程中的隐式优势——我们将之称为“进步优势”：经过强化学习训练的策略与其参考策略之间的对数概率比，恰好能够还原最优优势函数。这一形式使得所获信号无需标注、无关领域，且是标准强化学习后训练流程的副产品。我们通过五项基准测试和四个模型系列，在测试时扩展、不确定性量化以及失败归因三种不同应用中验证了进步优势的有效性。在所有设置下，它均持续优于基于置信度的基线方法，且尽管无需针对特定任务进行训练，却超越了专门训练的奖励模型。我们进一步结合对进步优势特性的深入分析，为实际智能体系统的应用提供实践指导。

English

Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training altogether. Concretely, we derive an implicit advantage under a general stochastic Markov decision process, which we term progress advantage -- log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function. This formulation makes the resulting signal annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline. We validate the effectiveness of the progress advantage across three different applications: test-time scaling, uncertainty quantification, and failure attribution on five benchmarks and four model families. Across all settings, it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models. We complement these results with deeper analyses on characteristics of progress advantage, offering practical guidance for adoption in real-world agentic systems.