EvoBrowseComp: 진화하는 지식에 대한 검색 에이전트 벤치마킹

초록

검색 에이전트(Search Agents) -- 검색 도구로 증강된 대규모 언어 모델 -- 는 미래 지향적인 평가 벤치마크의 필요성을 강화시켰다. BrowseComp와 같은 기존 벤치마크는 정적 지식에 의존하기 때문에 테스트 세트 오염(test-set contamination)과 파라미터 기억(parametric memorization)에 취약하다. 결과적으로 모델은 진정한 검색보다는 사실 회상을 통해 높은 점수를 달성할 수 있으며, 이는 추론 지름길(reasoning shortcuts)을 통해 진정한 브라우징 능력을 모호하게 만든다. 본 논문에서는 실시간 웹 탐색(live-web traversal)을 통해 합성된 400개의 영어 및 400개의 중국어 오염 없는 복잡한 질문으로 구성된 진화형 벤치마크인 EvoBrowseComp를 소개한다. 이러한 질문을 수집하기 위해 우리는 세 가지 에이전트 협업 프레임워크를 설계하였다: (1) 실시간 웹에서 새로운 지식을 검색하여 QA 쌍을 합성하는 QA 합성 에이전트; (2) 검색된 지식을 신뢰성과 인기도 측면에서 필터링하여 파라미터 지름길(parametric shortcuts)을 차단하는 정보 필터링 에이전트; (3) 질문을 추론 그래프(reasoning graphs)로 공식화하여 합성된 QA 쌍의 논리적 중복성과 지름길을 줄이는 고수준 안내 에이전트. 이 프레임워크는 완전 자동 합성을 지원하기 때문에 EvoBrowseComp는 데이터 오염을 방지하고 시간적 신선도를 유지하기 위해 정기적으로 업데이트될 수 있다. 광범위한 실험 결과, 이 벤치마크는 광범위한 수평 검색(horizontal search)을 요구하는 매우 높은 난이도를 확인하였다. 이는 진화하는 세계 지식과 발전하는 에이전트 능력에 보조를 맞추는 자동 업데이트 가능한 고난이도 벤치마킹을 위한 확장 가능한 패러다임을 확립한다.

English

Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.