AutoResearchBench: 複雑な科学文献発見におけるAIエージェントのベンチマーキング

要旨

自律的な科学研究は、AIエージェントの発展により大きく進歩している。このプロセスにおける重要なステップの一つが、適切な科学文献の発見である。これは、研究課題に対する既存の知見を探索するためであれ、仮説の検証や主張の裏付けとなる証拠を収集するためであれ重要である。このプロセスを推進するAIエージェントの能力を評価するため、自律的な科学文献発見に特化したベンチマーク「AutoResearchBench」を提案する。AutoResearchBenchは、互いに補完的な2種類のタスクで構成される：（1）段階的で多段階の調査プロセスを通じて特定の目標論文を追跡することを要求する「深層研究」と、（2）与えられた条件を満たす論文群を網羅的に収集することを要求する「広範研究」である。従来のエージェント的ウェブ閲覧に関するベンチマークと比較して、AutoResearchBenchは以下の3つの次元で特徴づけられる：研究指向性（科学的概念の深い理解を要求する）、文献焦点性（詳細情報のきめ細かい活用を要求する）、および開放性（適格論文数が未知であるため、意図的な推論と探索全体を要求する）。これらの特性により、AutoResearchBenchは自律的研究能力を評価するのに独自に適しており、非常に困難な課題となっている。BrowseCompのような一般的なエージェント的ウェブ閲覧ベンチマークをほぼ征服した最も強力な大規模言語モデルでさえ、深層研究では9.39%の精度、広範研究では9.31%のIoUしか達成できておらず、他の多くの強力なベースラインは5%を下回っている。今後の研究の発展を促進するため、データセットと評価パイプラインを公開する。データセット、評価パイプライン、コードはhttps://github.com/CherYou/AutoResearchBench で公開している。

English

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi-step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging. Even the most powerful LLMs, despite having largely conquered general agentic web-browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction. We publicly release the dataset, evaluation pipeline, and code at https://github.com/CherYou/AutoResearchBench.

AutoResearchBench: 複雑な科学文献発見におけるAIエージェントのベンチマーキング

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

要旨

Support