ChatPaper.aiChatPaper

BrowseComp-Plus:一个更公平、透明的深度研究智能体评估基准

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

August 8, 2025
作者: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin
cs.AI

摘要

深度研究代理(Deep-Research agents)通过将大型语言模型(LLMs)与搜索工具相结合,在处理需要迭代搜索规划及对搜索结果进行推理的复杂查询方面,已展现出提升效率的显著成效。然而,当前基于BrowseComp等基准的评估,依赖于黑箱式的实时网络搜索API,存在两大显著局限:(1)公平性:动态且不透明的网络API阻碍了深度研究方法的公平比较与可复现性;(2)透明度:缺乏对文档库的控制,难以单独评估检索器的贡献。换言之,现有评估虽能在特定时间点比较完整的深度研究系统,却未能促进精心控制的实验,以深入理解底层深度研究LLMs的能力。为应对这些挑战,我们推出了BrowseComp-Plus,这一源自BrowseComp的基准,采用了一个固定且精心筛选的文档库。BrowseComp-Plus中的每个查询均包含人工验证的支持文档及挖掘出的挑战性负样本,从而支持可控实验。该基准在区分深度研究系统性能方面表现出色。例如,开源模型Search-R1搭配BM25检索器时,准确率为3.86%,而GPT-5则达到55.9%。将GPT-5与Qwen3-Embedding-8B检索器结合,进一步将其准确率提升至70.1%,同时减少了搜索调用次数。此基准为深度研究代理与检索方法的全面评估及解耦分析提供了平台,促进了在检索效率、引用准确性及深度研究系统中的上下文工程等方面的深入洞察。
English
Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.
PDF352August 12, 2025