CORE-Bench: 計算再現性エージェントベンチマークを通じて公表された研究の信頼性を促進する

要旨

AIエージェントは、科学的研究を含むさまざまな重要なタスクでユーザーを支援する潜在能力を持っています。有用なエージェントの開発を促進するためには、挑戦的であると同時に、現実世界の興味深いタスクに直接対応するベンチマークが必要です。本論文では、科学的研究の重要ながら驚くほど挑戦的な側面である「計算再現性」を取り組むAIエージェントの精度を測定するために設計された、そのようなベンチマークを紹介します。このタスクは科学プロセスに基本的なものであり、提供されたコードとデータを使用して研究結果を再現することを含みます。私たちは、3つの分野（コンピュータサイエンス、社会科学、医学）にまたがる90の科学論文に基づく270のタスクで構成されるベンチマークであるCORE-Bench（計算再現性エージェントベンチマーク）を紹介します。CORE-Benchのタスクには3つの難易度レベルがあり、言語のみのタスクとビジョン言語のタスクの両方が含まれています。私たちは、エージェントの精度を迅速かつ並列化可能な方法で測定する評価システムを提供し、各ランの評価時間をシーケンシャルな実装に比べて数日節約します。私たちは、2つのベースラインエージェント、汎用AutoGPTとタスク固有のCORE-Agentを評価しました。両バリアントをGPT-4oとGPT-4o-miniという2つの基礎言語モデルを使用してテストしました。最高のエージェントは、最も難しいタスクで21％の精度を達成し、日常的な科学的タスクの自動化における改善の余地を示しました。既存の作業を再現できるエージェントを持つことは、新しい研究を行い、他の研究エージェントのパフォーマンスを検証および改善できるエージェントを構築するための必要な段階です。CORE-Benchが再現性の状態を改善し、将来の研究エージェントの開発を促進できることを願っています。

English

AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

CORE-Bench: 計算再現性エージェントベンチマークを通じて公表された研究の信頼性を促進する

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

要旨

Support