BrowseComp-Plus: 심층 연구 에이전트를 위한 더 공정하고 투명한 평가 벤치마크

초록

대형 언어 모델(LLMs)과 검색 도구를 통합한 딥 리서치 에이전트는 반복적인 검색 계획과 검색 결과에 대한 추론이 필요한 복잡한 쿼리를 처리하는 효과를 향상시키는 데 성공을 보여왔습니다. BrowseComp와 같은 현재의 벤치마크는 블랙박스 라이브 웹 검색 API를 사용하여 평가되며, 다음과 같은 두 가지 주요 한계가 있습니다: (1) 공정성: 동적이고 불투명한 웹 API는 딥 리서치 방법의 공정한 비교와 재현성을 방해합니다; (2) 투명성: 문서 코퍼스에 대한 통제가 부족하여 검색기의 기여를 분리하기 어렵습니다. 즉, 현재의 평가는 주어진 시점에서 완전한 딥 리서치 시스템을 비교할 수는 있지만, 기저에 있는 딥 리서치 LLM의 능력을 통찰하기 위한 잘 통제된 실험을 촉진하지는 못합니다. 이러한 문제를 해결하기 위해, 우리는 BrowseComp에서 파생된 벤치마크인 BrowseComp-Plus를 소개합니다. 이 벤치마크는 고정된, 신중하게 선별된 코퍼스를 사용합니다. BrowseComp-Plus의 각 쿼리에는 인간이 검증한 지원 문서와 도전적인 네거티브 샘플이 포함되어 있어 통제된 실험을 가능하게 합니다. 이 벤치마크는 딥 리서치 시스템의 성능을 구분하는 데 효과적인 것으로 입증되었습니다. 예를 들어, 오픈소스 모델인 Search-R1은 BM25 검색기와 함께 사용될 때 3.86%의 정확도를 달성한 반면, GPT-5는 55.9%의 정확도를 보였습니다. GPT-5를 Qwen3-Embedding-8B 검색기와 통합하면 더 적은 검색 호출로 70.1%의 정확도를 달성할 수 있었습니다. 이 벤치마크는 딥 리서치 에이전트와 검색 방법에 대한 포괄적인 평가와 분리된 분석을 가능하게 하여, 검색 효과, 인용 정확도, 그리고 딥 리서치 시스템의 컨텍스트 엔지니어링에 대한 통찰을 촉진합니다.

English

Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.

BrowseComp-Plus: 심층 연구 에이전트를 위한 더 공정하고 투명한 평가 벤치마크

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

초록

Support