深度搜索揭秘：基于无提示多跳问题与分解指标的综合评估

摘要

RAG（检索增强生成）系统与网络代理在多跳深度搜索任务上的评估日益增多，然而当前实践存在两大主要局限。首先，多数基准测试在问题文本中泄露了推理路径，使得模型能够依赖表面线索而非自主发现推理链条。其次，评估通常简化为单一通过率，将多样行为压缩为一个分数，掩盖了失败是源于搜索不足、知识利用不佳还是不恰当拒绝。为解决这些问题，我们提出了WebDetective，一个无提示多跳问题基准，搭配受控的维基百科沙盒，确保模型行为的完全可追溯性，以及一个全面评估框架，区分搜索充分性、知识利用和拒绝行为。我们对25个顶尖模型的评估揭示了所有架构中的系统性弱点：尽管证据充分，模型在知识利用上仍显吃力；在证据缺失时，几乎不存在恰当的拒绝行为。这些模式暴露了一个根本性差距：当今系统擅长执行给定的推理路径，但在需要发现这些路径时却表现不佳。我们开发了一个代理工作流——EvidenceLoop，明确针对我们基准识别出的挑战，整合了验证循环和系统化的证据追踪，提升了搜索与综合能力。这一基线表明，WebDetective的诊断框架能够指导具体的架构改进，确立我们的基准为开发真正自主推理系统而非模式跟随代理的关键工具。

English

RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

深度搜索揭秘：基于无提示多跳问题与分解指标的综合评估

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

摘要

Support