심층 탐색의 이해: 힌트 없는 다중 홉 질문과 요인화된 지표를 통한 종합적 평가

초록

RAG(검색 증강 생성) 시스템과 웹 에이전트는 점점 더 다중 홉 심층 검색 작업에서 평가되고 있지만, 현재의 관행은 두 가지 주요 한계를 겪고 있습니다. 첫째, 대부분의 벤치마크는 질문 텍스트에 추론 경로를 노출시켜 모델이 자율적으로 추론 체인을 발견하기보다는 표면적인 단서를 따르도록 합니다. 둘째, 평가는 일반적으로 단일 통과율로 축소되어 다양한 행동을 하나의 점수로 압축하며, 실패가 부적절한 검색, 지식 활용의 미흡, 또는 부적절한 거부에서 비롯된 것인지 불분명하게 만듭니다. 이러한 문제를 해결하기 위해, 우리는 힌트가 없는 다중 홉 질문과 모델 행동의 완전한 추적성을 보장하는 통제된 위키피디아 샌드박스를 결합한 WebDetective 벤치마크와, 검색 충분성, 지식 활용, 거부 행동을 분리한 종합적인 평가 프레임워크를 제시합니다. 25개의 최신 모델에 대한 우리의 평가는 모든 아키텍처에서 체계적인 약점을 드러냈습니다: 모델들은 충분한 증거가 있음에도 불구하고 지식 활용에 어려움을 겪으며, 증거가 부족할 때 적절한 거부가 거의 없음을 보였습니다. 이러한 패턴은 오늘날의 시스템이 주어진 추론 경로를 실행하는 데는 뛰어나지만, 이를 발견해야 할 때는 실패한다는 근본적인 격차를 드러냅니다. 우리는 벤치마크가 식별한 도전을 명시적으로 타겟팅하는 에이전트 워크플로우인 EvidenceLoop를 개발했습니다. 이 워크플로우는 검증 루프와 체계적인 증거 추적을 통합하여 검색과 합성 능력을 모두 개선합니다. 이 베이스라인은 WebDetective의 진단 프레임워크가 구체적인 아키텍처 개선을 안내할 수 있음을 보여주며, 패턴을 따르는 에이전트가 아닌 진정으로 자율적인 추론 시스템을 개발하기 위한 중요한 도구로서 우리의 벤치마크를 확립합니다.

English

RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

심층 탐색의 이해: 힌트 없는 다중 홉 질문과 요인화된 지표를 통한 종합적 평가

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

초록

Support