EvoBrowseComp：演化知识下的搜索智能体基准测试

摘要

搜索代理——即通过搜索工具增强的大型语言模型——加剧了对面向未来的评估基准的需求。现有的基准如BrowseComp依赖静态知识，容易受到测试集污染和参数记忆的影响。因此，模型可以通过事实回忆而非真正的检索来获得高分，通过推理捷径掩盖真实的浏览能力。本文提出EvoBrowseComp，一个可演进的基准，包含400个英文和400个中文的无污染复杂问题，通过实时网络遍历合成。为了收集这些问题，我们设计了一个三智能体协作框架：（1）问题生成智能体，从实时网络获取新鲜知识以合成问答对；（2）信息过滤智能体，根据可信度和流行度过滤检索到的知识，阻断参数捷径；（3）高层引导智能体，将问题形式化为推理图，减少合成问答对中的逻辑冗余和捷径。由于该框架支持全自动化合成，EvoBrowseComp可以定期更新以防止数据污染并保持时间新鲜度。大量实验证明其难度极大，需要广泛的横向搜索。它为自动更新、高难度基准测试建立了一个可扩展的范式，能够跟上不断变化的世界知识和不断进步的人工智能代理能力。

English

Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.