심층 연구 에이전트는 어디에서 실패하는가? 에이전트 궤적 내 스팬 수준 오류 위치 파악

초록

심층 연구 에이전트는 검색, 도구 사용, 증거 검토 및 답변 종합의 긴 궤적을 통해 작업을 해결합니다. 최종 답변에 기반한 평가는 에이전트가 성공했는지 여부는 보여주지만, 궤적의 어떤 부분이 답변을 신뢰할 수 없게 만드는지는 보여주지 않습니다. 우리는 심층 연구 에이전트에 대한 스팬 수준 오류 위치 파악을 연구합니다. 두 가지 에이전트 프레임워크, 세 가지 백본 모델, 세 가지 벤치마크에서 2,790개의 실제 궤적을 수집하고, 원시 로그를 의미적 스팬으로 변환한 후, LLM 지원 전문가 검토를 통해 유해한 오류 스팬을 주석 처리합니다. 이러한 주석을 바탕으로 정상 탐색, 실패한 검색, 잠정적 가설, 무해한 노이즈 중에서 오류 스팬을 식별하기 위한 1,000개 인스턴스 벤치마크인 TELBench를 구축합니다. 또한 에이전트의 주장을 추적하고, 궤적 증거에서 해당 주장의 지지를 확인하며, 지지되지 않거나 상충되는 주장이 답변 경로에 영향을 미치는 스팬을 표시하는 주장 중심 감사 프레임워크인 DRIFT를 제안합니다. 모델 계열 및 감사 프레임워크에 걸친 실험은 DRIFT가 스팬 수준 오류 위치 파악 및 첫 번째 오류 정확도를 최대 30%포인트 향상시킴을 보여줍니다. 우리의 연구는 심층 연구 에이전트의 신뢰성에 대한 프로세스 수준의 관점을 제공합니다.

English

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.