深層検索の解明：ヒントなし多段階質問と分解された指標を用いた包括的評価

要旨

RAG（Retrieval-Augmented Generation）システムやウェブエージェントは、マルチホップの深層検索タスクにおいてますます評価されるようになっているが、現在の実践には2つの大きな課題がある。第一に、ほとんどのベンチマークでは質問テキストに推論パスが漏洩しており、モデルが表面的な手がかりを追うだけで、自律的に推論チェーンを発見することができなくなっている。第二に、評価は通常単一の正答率に還元されており、多様な振る舞いを1つのスコアに集約してしまうため、失敗が不十分な検索、知識の不適切な利用、または不適切な拒否のいずれに起因するのかが不明瞭になっている。これらの課題に対処するため、我々はWebDetectiveを提案する。これは、ヒントのないマルチホップ質問と、モデルの行動の完全な追跡可能性を保証する制御されたWikipediaサンドボックスを組み合わせたベンチマークであり、検索の十分性、知識の利用、拒否行動を分離した包括的な評価フレームワークを提供する。25の最先端モデルを評価した結果、すべてのアーキテクチャにわたって体系的な弱点が明らかになった：モデルは十分な証拠があるにもかかわらず知識の利用に苦戦し、証拠が不足している場合には適切な拒否がほとんど見られなかった。これらのパターンは、今日のシステムが与えられた推論パスを実行するのは得意だが、それらを発見する必要がある場合には失敗するという根本的なギャップを露呈している。我々は、ベンチマークが特定した課題に明示的に対処するエージェント型ワークフローEvidenceLoopを開発し、検証ループと体系的な証拠追跡を組み込むことで、検索と合成の両方の能力を向上させた。このベースラインは、WebDetectiveの診断フレームワークが具体的なアーキテクチャの改善を導くことができることを示しており、パターン追従型エージェントではなく、真に自律的な推論システムを開発するための重要なツールとして我々のベンチマークを確立している。

English

RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

深層検索の解明：ヒントなし多段階質問と分解された指標を用いた包括的評価

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

要旨

Support