深層リサーチエージェントはどこで間違うのか？エージェントの軌跡におけるスパンレベルの誤り位置特定

要旨

深層研究エージェントは、検索、ツール使用、証拠検査、回答合成からなる長い軌跡を通じてタスクを解決します。最終回答に基づく評価はエージェントが成功したかどうかを示しますが、軌跡のどの部分が回答を信頼性の低いものにしているかは示しません。我々は、深層研究エージェントのためのスパンレベルのエラー特定を研究します。我々は、2つのエージェントフレームワーク、3つのバックボーンモデル、3つのベンチマークから2,790の実際の軌跡を収集し、生のログを意味的スパンに変換し、LLM支援による専門家レビューを通じて有害なエラースパンを注釈付けします。これらの注釈から、我々はTELBenchを構築します。これは、通常の探索、失敗した検索、暫定仮説、無害なノイズの中からエラースパンを特定するための1,000インスタンスのベンチマークです。さらに我々はDRIFTを提案します。これは、エージェントの主張を追跡し、軌跡の証拠におけるそれらの支持をチェックし、根拠のない主張や矛盾した主張が回答経路に影響を与えるスパンをマークする、主張中心の監査フレームワークです。モデルファミリーと監査フレームワークにわたる実験は、DRIFTがスパンレベルのエラー特定と最初のエラー精度を最大30パーセントポイント向上させることを示しています。我々の研究は、深層研究エージェントにおける信頼性のプロセスレベルのビューを提供します。

English

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.