深度搜索解密:基於無提示多跳問題與分解指標的全面評估
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics
October 1, 2025
作者: Maojia Song, Renhang Liu, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Soujanya Poria, Jingren Zhou
cs.AI
摘要
RAG(檢索增強生成)系統與網路代理在多跳深度搜索任務上的評估日益增多,然而當前實踐存在兩大主要限制。首先,多數基準測試在問題文本中洩露了推理路徑,使得模型能夠依循表面線索而非自主發現推理鏈。其次,評估通常簡化為單一通過率,這將多樣行為壓縮為一個分數,並模糊了失敗是源於搜索不足、知識利用不佳還是不當拒絕。為解決這些問題,我們提出了WebDetective,這是一個無提示多跳問題的基準測試,配備了一個受控的維基百科沙盒,確保模型行為的完全可追溯性,以及一個分離搜索充分性、知識利用和拒絕行為的全面評估框架。我們對25個最先進模型的評估揭示了所有架構中的系統性弱點:模型在擁有足夠證據的情況下仍難以有效利用知識,並在證據缺乏時幾乎不存在適當的拒絕行為。這些模式暴露了一個根本性差距:當今系統在執行給定推理路徑時表現出色,但在需要發現這些路徑時卻失敗了。我們開發了一個名為EvidenceLoop的代理工作流程,專門針對我們基準測試所識別的挑戰,整合了驗證循環和系統性證據追蹤,從而提升了搜索與綜合能力。這一基線展示了WebDetective的診斷框架能夠引導具體的架構改進,確立了我們的基準測試作為開發真正自主推理系統而非模式跟隨代理的關鍵工具。
English
RAG (Retrieval-Augmented Generation) systems and web agents are increasingly
evaluated on multi-hop deep search tasks, yet current practice suffers from two
major limitations. First, most benchmarks leak the reasoning path in the
question text, allowing models to follow surface cues rather than discover
reasoning chains autonomously. Second, evaluation is typically reduced to a
single pass rate, which collapses diverse behaviours into one score and
obscures whether failures stem from inadequate search, poor knowledge use, or
inappropriate refusal. To address these issues, we present WebDetective, a
benchmark of hint-free multi-hop questions paired with a controlled Wikipedia
sandbox that ensures full traceability of model actions, and a holistic
evaluation framework that separates search sufficiency, knowledge utilisation,
and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals
systematic weaknesses across all architectures: models struggle with knowledge
utilisation despite having sufficient evidence and demonstrate near-absent
appropriate refusal when evidence is lacking. These patterns expose a
fundamental gap: today's systems excel at executing given reasoning paths but
fail when required to discover them. We develop an agentic workflow,
EvidenceLoop, that explicitly targets the challenges our benchmark identifies,
incorporating verification loops and systematic evidence tracking that improve
both search and synthesis capabilities. This baseline demonstrates that
WebDetective's diagnostic framework can guide concrete architectural
improvements, establishing our benchmark as a critical tool for developing
genuinely autonomous reasoning systems rather than pattern-following agents.