證據鏈接:具備引用感知評分獎勵機制的深度搜尋智能體之強健強化學習
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
January 9, 2026
作者: Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li
cs.AI
摘要
強化學習(RL)已成為提升基於大型語言模型的深度搜尋代理之關鍵技術。然而,現有方法主要依賴二元結果獎勵,這類獎勵無法捕捉代理推理過程的全面性與事實依據,常導致捷徑利用和幻覺等不良行為。為解決這些侷限,我們提出「引用感知型評分量規獎勵」(CaRR),這是一種針對深度搜尋代理的細粒度獎勵框架,強調推理的全面性、事實基礎與證據連貫性。CaRR 將複雜問題分解為可驗證的單跳評分量規,要求代理通過明確識別隱藏實體、提供正確引用支持,並建構與預測答案相連結的完整證據鏈來滿足這些量規。我們進一步提出「引用感知型群組相對策略優化」(C-GRPO),結合 CaRR 與結果獎勵以訓練魯棒的深度搜尋代理。實驗顯示,C-GRPO 在多個深度搜尋基準測試中均穩定優於標準的基於結果之強化學習基線。分析結果也驗證了 C-GRPO 能有效抑制捷徑利用,促進全面且基於證據的推理,並在開放式深度研究任務中展現強泛化能力。相關程式碼與數據已公開於:https://github.com/THUDM/CaRR。
English
Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose Citation-aware Rubric Rewards (CaRR), a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce Citation-aware Group Relative Policy Optimization (C-GRPO), which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.