深度研究智能体在何处出错？——智能体轨迹中的片段级错误定位

摘要

深度调研代理通过搜索、工具使用、证据核查与答案合成等长程轨迹来完成任务。基于最终答案的评估能判断代理是否成功，但无法揭示轨迹中哪些环节导致答案不可靠。本文针对深度调研代理展开跨度级错误定位研究。我们从两个代理框架、三个骨干模型和三个基准测试中收集了2,790条真实轨迹，将原始日志转化为语义跨度，并通过大语言模型辅助的专家评审标注有害错误跨度。基于这些标注，我们构建了TELBench——一个包含1,000个实例的基准测试，用于识别正常探索、搜索失败、暂定假设与无害噪声中的错误跨度。我们进一步提出DRIFT，一种以主张为中心的审计框架，该框架追踪代理主张，核查其在轨迹证据中的支持程度，并标注那些因无依据或矛盾的主张而影响答案路径的跨度。跨模型族与审计框架的实验表明，DRIFT将跨度级错误定位与首次错误准确率提升了最多30个百分点。本研究为深度调研代理的可靠性提供了过程层面的新视角。

English

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.