探索智能体推理奖励模型
Exploring Reasoning Reward Model for Agents
January 29, 2026
作者: Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue
cs.AI
摘要
代理强化学习(Agentic RL)在实现智能体进行复杂推理与工具使用方面已取得显著成功。然而,现有方法大多仍依赖稀疏的结果型奖励进行训练。此类反馈无法区分中间推理质量,导致训练效果欠佳。本文提出代理推理奖励模型(Agent-RRM),该多维度奖励模型可为代理轨迹生成结构化反馈,包括:(1)显式推理轨迹;(2)通过突出推理缺陷提供细化指导的聚焦式批判;(3)评估过程性能的综合评分。基于这些信号,我们系统研究三种集成策略:Reagent-C(文本增强优化)、Reagent-R(奖励增强指导)和Reagent-U(统一反馈集成)。在12个多样化基准测试上的广泛评估表明,Reagent-U实现性能大幅跃升,在GAIA和WebWalkerQA上分别达到43.7%和46.2%的得分,验证了推理奖励模型与训练方案的有效性。我们已全面公开代码、模型及数据集以促进后续研究。
English
Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.