EvoBrowseComp: 進化する知識における検索エージェントのベンチマーキング

要旨

検索エージェント（検索ツールで拡張された大規模言語モデル）は、将来性を保証できる評価ベンチマークの必要性を高めている。BrowseCompのような既存のベンチマークは静的知識に依存しており、テストセット汚染やパラメトリック記憶に対して脆弱である。その結果、モデルは真の情報検索ではなく事実想起によって高スコアを達成でき、推論の近道を通じて真のブラウジング能力を曖昧にしてしまう。本論文では、ライブウェブ探索を通じて合成された、汚染のない400の英語と400の中国語の複雑な質問からなる進化型ベンチマーク、EvoBrowseCompを紹介する。これらの質問を収集するために、3つのエージェントからなる協調フレームワークを設計した。(1) ライブウェブから新鮮な知識を取得し、QAペアを合成するQA合成エージェント、(2) 取得した知識を信頼性と人気度の観点からフィルタリングし、パラメトリックな近道を遮断する情報フィルタリングエージェント、(3) 質問を推論グラフに形式化し、合成されたQAペアにおける論理的冗長性や近道を削減する高レベルガイダンスエージェントである。このフレームワークは完全自動合成をサポートするため、EvoBrowseCompは定期的に更新され、データ汚染を防ぎ、時間的な新鮮さを維持できる。広範な実験により、その非常に高い難易度と、広範な水平検索の必要性が確認された。本手法は、進化する世界知識と高度化するエージェント能力の両方に歩調を合わせる、自動更新可能で高難度なベンチマーキングのためのスケーラブルなパラダイムを確立する。

English

Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.