CORE-Bench:通过计算再现代理基准测试促进已发表研究的可信度
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
September 17, 2024
作者: Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan
cs.AI
摘要
AI代理有潜力帮助用户完成各种重要任务,包括进行科学研究。为了推动有用代理的发展,我们需要具有挑战性的基准,更重要的是,这些基准直接对应于感兴趣的真实世界任务。本文介绍了这样一个基准,旨在衡量AI代理在处理科学研究中一个至关重要但令人意外具有挑战性的方面的准确性:计算可重现性。这一任务对科学过程至关重要,涉及使用提供的代码和数据重现研究结果。我们介绍了CORE-Bench(计算可重现性代理基准),这是一个基准,包含了270个任务,基于三个学科领域(计算机科学、社会科学和医学)中的90篇科学论文。CORE-Bench中的任务包括三个难度级别,包括仅语言和视觉-语言任务。我们提供了一个评估系统,以快速且可并行化的方式衡量代理的准确性,相较于顺序实现,每次运行节省数天的评估时间。我们评估了两个基准代理:通用的AutoGPT和一个特定任务的代理称为CORE-Agent。我们使用了两个基础语言模型对这两个变体进行了测试:GPT-4o和GPT-4o-mini。最佳代理在最困难的任务上实现了21%的准确性,显示了在自动化常规科学任务方面有巨大改进的空间。拥有能够重现现有工作的代理是构建能够进行新颖研究并验证和改进其他研究代理性能的必要步骤。我们希望CORE-Bench能够改善可重现性状况并推动未来研究代理的发展。
English
AI agents have the potential to aid users on a variety of consequential
tasks, including conducting scientific research. To spur the development of
useful agents, we need benchmarks that are challenging, but more crucially,
directly correspond to real-world tasks of interest. This paper introduces such
a benchmark, designed to measure the accuracy of AI agents in tackling a
crucial yet surprisingly challenging aspect of scientific research:
computational reproducibility. This task, fundamental to the scientific
process, involves reproducing the results of a study using the provided code
and data. We introduce CORE-Bench (Computational Reproducibility Agent
Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers
across three disciplines (computer science, social science, and medicine).
Tasks in CORE-Bench consist of three difficulty levels and include both
language-only and vision-language tasks. We provide an evaluation system to
measure the accuracy of agents in a fast and parallelizable way, saving days of
evaluation time for each run compared to a sequential implementation. We
evaluated two baseline agents: the general-purpose AutoGPT and a task-specific
agent called CORE-Agent. We tested both variants using two underlying language
models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on
the hardest task, showing the vast scope for improvement in automating routine
scientific tasks. Having agents that can reproduce existing work is a necessary
step towards building agents that can conduct novel research and could verify
and improve the performance of other research agents. We hope that CORE-Bench
can improve the state of reproducibility and spur the development of future
research agents.Summary
AI-Generated Summary