证据链构建:基于引用感知评分奖励的深度搜索智能体鲁棒强化学习
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
January 9, 2026
作者: Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li
cs.AI
摘要
强化学习(RL)已成为提升基于大语言模型的深度搜索智能体的关键技术。然而,现有方法主要依赖二元结果奖励,这种机制无法有效捕捉智能体推理过程的全面性与事实准确性,常导致捷径利用和幻觉生成等不良行为。为克服这些局限,我们提出引用感知的细粒度奖励框架(CaRR),该框架通过分解复杂问题为可验证的单步评估准则,要求智能体通过显式识别隐藏实体、提供正确引用支撑、构建连接预测答案的完整证据链来满足这些准则,从而强调推理的全面性、事实依据性和证据连贯性。我们进一步提出引用感知的分组相对策略优化算法(C-GRPO),结合CaRR与结果奖励共同训练鲁棒的深度搜索智能体。实验表明,C-GRPO在多个深度搜索基准测试中均稳定优于基于结果奖励的强化学习基线方法。分析结果验证了C-GRPO能有效抑制捷径利用行为,促进全面且基于证据的推理过程,并在开放式深度研究任务中展现出强泛化能力。代码与数据已开源:https://github.com/THUDM/CaRR。
English
Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose Citation-aware Rubric Rewards (CaRR), a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce Citation-aware Group Relative Policy Optimization (C-GRPO), which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.