深度研究智能體在何處出錯？智能體軌跡中的跨度級錯誤定位

摘要

深度研究智能体通过搜索、工具使用、证据审查和答案合成的长程轨迹来解决问题。基于最终答案的评估可以判断智能体是否成功，但无法揭示轨迹中哪些环节导致答案不可靠。本文研究深度研究智能体的跨度级错误定位问题。我们从两个智能体框架、三个骨干模型和三个基准测试中收集了2,790条真实轨迹，将原始日志转换为语义跨度，并通过LLM辅助的专家审查标注有害错误跨度。基于这些标注，我们构建了TELBench，一个包含1,000个实例的基准数据集，用于在正常探索、失败搜索、初步假设和无害噪声中识别错误跨度。此外，我们提出了DRIFT，一个以声明为中心的审计框架，该框架追踪智能体的声明，检查这些声明在轨迹证据中得到支持的程度，并标记出因无支持或矛盾声明而影响答案路径的跨度。跨模型系列和审计框架的实验表明，DRIFT将跨度级错误定位和首次错误准确率提升了最多30个百分点。我们的工作为深度研究智能体的可靠性提供了过程级视角。

English

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.