LiveBrowseComp：搜索智能体是在搜索，还是在验证已知信息？

摘要

基于大语言模型的搜索代理是真正在搜索，还是利用网络来验证其已知信息？我们通过三种诊断方法在BrowseComp上研究这一问题。分析揭示了内在知识依赖（IKD）：即便拥有工具访问权限，搜索代理往往依赖内在知识——即检索前模型中已编码的信息——而非外部证据。代理在没有工具的情况下回答了多达44.5%的BrowseComp问题，其生成的搜索查询中超过一半来自内部产生的假设而非检索线索，且当移除答案支撑证据后，其表现甚至低于闭卷基线。这些结果表明，静态搜索基准可能奖励基于记忆的验证而非证据驱动的发现，混淆了代理已知内容与其所能发现的内容。为此，我们提出LiveBrowseComp——一个旨在评估代理超越内在知识覆盖范围的深度搜索基准。该基准包含335个由人类撰写的问题，其答案依赖于基准构建前90天内发布的事实，来源涵盖六个持续更新的信息源，并剔除了全球性显著事件。在LiveBrowseComp上，所有评估的代理闭卷准确率均低于2%，搜索增强得分较BrowseComp下降25-40分，且先前的模型排名不再可靠地预测性能。LiveBrowseComp访问地址：https://huggingface.co/datasets/Forival/LiveBrowseComp

English

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.