LiveBrowseComp: 検索エージェントは検索しているのか、それとも既知の情報を検証しているだけなのか？

要旨

LLMベースの検索エージェントは本当に検索を行っているのか、それとも既に知っていることを検証するためにウェブを利用しているのか。本稿では、BrowseComp上で3つの診断的手法を用いてこの問題を研究する。分析の結果、内在的知識依存（IKD：Intrinsic Knowledge Dependence）が明らかになった。すなわち、ツールにアクセスできる場合でも、エージェントは外部の証拠よりも、検索前にモデルに符号化された情報である内在的知識に依存することが多い。エージェントはBrowseCompの質問の最大44.5%をツールなしで回答し、検索クエリの半数以上を検索結果から得た手がかりではなく、内部で生成した仮説に基づいて生成し、回答を裏付ける証拠が除去された場合には、閉じた書籍ベースのベースラインよりも性能が低下する。これらの結果は、静的な検索ベンチマークが、証拠に基づく発見ではなく、記憶に基づく検証を報酬として与える可能性があり、エージェントが既に知っていることと見つけられることを混同していることを示唆する。そこで我々は、内在的カバレッジを超えたエージェントを評価するために設計された深層検索ベンチマークであるLiveBrowseCompを導入する。これには、ベンチマーク構築前の90日以内に公開された事実に依存する回答を持つ、335の人間が作成した質問が含まれており、6つの更新された情報源から抽出され、世界的に顕著なイベントを除外するようフィルタリングされている。LiveBrowseCompでは、評価されたすべてのエージェントの閉じた書籍ベースの正解率は2%未満であり、検索拡張によるスコアはBrowseCompと比較して25～40ポイント低下し、以前のモデルランキングはもはや性能を確実に予測しない。LiveBrowseCompはhttps://huggingface.co/datasets/Forival/LiveBrowseCompで入手可能である。

English

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.