BrowseComp-Plus: 深層研究エージェントのより公平で透明性の高い評価ベンチマーク

要旨

大規模言語モデル（LLMs）と検索ツールを統合したDeep-Researchエージェントは、反復的な検索計画と検索結果に対する推論を必要とする複雑なクエリの処理効果を向上させることに成功を示しています。現在のベンチマークであるBrowseCompの評価は、ブラックボックスのライブウェブ検索APIに依存しており、(1)公平性：動的で不透明なウェブAPIが、Deep-Research手法の公平な比較と再現性を妨げている、(2)透明性：ドキュメントコーパスに対する制御の欠如が、検索器の貢献を分離することを困難にしている、という顕著な制限があります。言い換えれば、現在の評価は特定の時点での完全なDeep-Researchシステムを比較するかもしれませんが、基礎となるDeep-Research LLMの能力に関する洞察を提供するための十分に制御された実験を促進しません。これらの課題に対処するため、BrowseCompから派生した固定された慎重に選ばれたコーパスを採用したベンチマーク、BrowseComp-Plusを導入します。BrowseComp-Plusの各クエリには、人間によって検証されたサポートドキュメントと採掘された挑戦的なネガティブが含まれており、制御された実験を可能にします。このベンチマークは、Deep-Researchシステムの性能を区別するのに効果的であることが示されています。例えば、オープンソースモデルSearch-R1はBM25検索器と組み合わせた場合、3.86%の精度を達成しますが、GPT-5は55.9%の精度を達成します。GPT-5をQwen3-Embedding-8B検索器と統合することで、検索呼び出しを減らしながら精度を70.1%にさらに向上させます。このベンチマークは、Deep-Researchエージェントと検索方法の包括的な評価と分離分析を可能にし、検索効果、引用精度、およびDeep-Researchシステムにおけるコンテキストエンジニアリングに関する洞察を促進します。

English

Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.

BrowseComp-Plus: 深層研究エージェントのより公平で透明性の高い評価ベンチマーク

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

要旨

Support