LiveBrowseComp:搜尋代理是在搜尋,還是僅僅在驗證它們已知的內容?
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
May 27, 2026
作者: HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu
cs.AI
摘要
基於大型語言模型的搜尋代理是否真正在進行搜尋,抑或只是利用網路來驗證他們已知的知識?我們透過三個診斷指標,在BrowseComp上探討此問題。我們的分析揭示了「內在知識依賴」(Intrinsic Knowledge Dependence, IKD):即使具備工具使用權限,代理往往仍依賴內在知識(即模型在檢索前已編碼的資訊),而非外部證據。代理在無工具輔助下回答了高達44.5%的BrowseComp問題,其產生的搜尋查詢中,超過半數來自內部產生的假設而非檢索所得的線索,且當移除支持答案的證據時,其表現甚至不如閉卷基準。這些結果表明,靜態搜尋基準可能獎勵基於記憶的驗證,而非基於證據的發現,從而混淆了代理「已知的知識」與「能發現的知識」。為此,我們提出了LiveBrowseComp:一個旨在評估代理超越內在知識覆蓋範圍的深度搜尋基準。該基準包含335道由人類撰寫的問題,其答案依賴於基準建構前90天內發佈的事實,資料來源涵蓋六個定期更新的來源,並已過濾排除全球矚目的事件。在LiveBrowseComp上,所有受評代理的閉卷準確率均低於2%,搜尋增強分數相較BrowseComp下降25至40個百分點,且先前的模型排名不再能可靠預測其表現。LiveBrowseComp可於 https://huggingface.co/datasets/Forival/LiveBrowseComp 取得。
English
Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.