DeepResearch Bench: ディープリサーチエージェントのための包括的ベンチマーク

要旨

ディープリサーチエージェントは、LLMベースのエージェントの中でも特に注目すべきカテゴリーです。これらは、多段階のウェブ探索、ターゲットを絞った情報検索、そして高次の統合を自律的に調整することで、膨大なオンライン情報をアナリストレベルの引用豊富なレポートに変換し、手動でのデスクリサーチに要する時間を数分に圧縮します。しかし、これらのエージェントの能力を体系的に評価するための包括的なベンチマークはまだ存在しません。このギャップを埋めるため、我々はDeepResearch Benchを提案します。これは、22の異なる分野のドメインエキスパートによって慎重に作成された100の博士号レベルの研究タスクからなるベンチマークです。 DRAsを評価することは本質的に複雑で労力を要するため、我々は人間の判断と強く一致する2つの新しい方法論を提案します。1つ目は、生成された研究レポートの品質を評価するための適応基準を備えた参照ベースの方法です。もう1つのフレームワークは、DRAsの情報検索および収集能力を評価するために、その有効な引用数と全体的な引用精度を評価するために導入されました。我々は、実用的なLLMベースのエージェントの開発を加速するため、DeepResearch Benchとこれらのフレームワークの主要コンポーネントをhttps://github.com/Ayanami0730/deep_research_benchでオープンソース化しました。

English

Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.