AutoResearchBench: 복잡한 과학 문헌 발견 작업에서 AI 에이전트 성능 벤치마킹

초록

자율 과학 연구는 AI 에이전트의 발전 덕분에 크게 진전되었습니다. 이 과정의 핵심 단계 중 하나는 연구 문제에 대한 기존 지식을 탐색하거나 가정 검증 및 주장 지지를 위한 증거를 확보하기 위해 적절한 과학 문헌을 찾는 것입니다. AI 에이전트가 이 과정을 주도하는 능력을 평가하기 위해 우리는 자율 과학 문헌 발견을 위한 전용 벤치마크인 AutoResearchBench를 제시합니다. AutoResearchBench는 상호 보완적인 두 가지 작업 유형으로 구성됩니다: (1) 점진적인 다단계 탐색 과정을 통해 특정 대상 논문을 추적해야 하는 심층 연구(Deep Research)와 (2) 주어진 조건을 만족하는 논문 집합을 포괄적으로 수집해야 하는 광범위 연구(Wide Research). 기존의 에이전트 웹 브라우징 벤치마크와 비교했을 때 AutoResearchBench는 세 가지 차원에서 차별화됩니다: 과학적 개념에 대한 깊은 이해를 요구하는 연구 지향성, 상세 정보의 정교한 활용을 요구하는 문헌 중심성, 그리고 적격 논문의 수가 미리 정해져 있지 않아 전 과정에 걸친 신중한 추론과 탐색을 필요로 하는 개방성입니다. 이러한 특성들은 AutoResearchBench를 자율 연구 능력 평가에 특히 적합하게 만들며 동시에 매우 도전적인 과제로 만듭니다. BrowseComp와 같은 일반적인 에이전트 웹 브라우징 벤치마크를 크게 정복한 가장 강력한 대형 언어 모델조차도 심층 연구에서 9.39%의 정확도, 광범위 연구에서 9.31%의 IoU만을 달성하는 반면, 다른 많은 강력한 베이스라인 모델들은 5% 미만의 성능에 머물고 있습니다. 우리는 이 방향의 향후 연구를 촉진하기 위해 데이터셋과 평가 파이프라인을 공개합니다. 데이터셋, 평가 파이프라인 및 코드는 https://github.com/CherYou/AutoResearchBench에서 공개됩니다.

English

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi-step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging. Even the most powerful LLMs, despite having largely conquered general agentic web-browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction. We publicly release the dataset, evaluation pipeline, and code at https://github.com/CherYou/AutoResearchBench.

AutoResearchBench: 복잡한 과학 문헌 발견 작업에서 AI 에이전트 성능 벤치마킹

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

초록

Support