EXP-Bench：人工智能能否执行AI研究实验？

摘要

自动化人工智能研究具有加速科学进步的巨大潜力，然而当前的人工智能代理在应对严谨、端到端实验的复杂性方面仍面临挑战。我们引入了EXP-Bench，这是一个新颖的基准测试，旨在系统评估人工智能代理在源自有影响力的人工智能出版物的完整研究实验中的表现。给定一个研究问题和不完整的初始代码，EXP-Bench挑战人工智能代理提出假设、设计并实施实验程序、执行实验并分析结果。为了能够创建如此复杂且真实的高保真任务，我们设计了一个半自动化流程，从这些研究论文及其相关的开源代码中提取并结构化关键的实验细节。通过该流程，EXP-Bench从51篇顶级人工智能研究论文中精选了461项人工智能研究任务。对基于大型语言模型的领先代理（如OpenHands和IterativeAgent）在EXP-Bench上的评估显示，它们在个别实验方面（如设计或实施正确性）的得分偶尔达到20-35%，但完整、可执行实验的成功率仅为0.5%。通过识别这些瓶颈并提供现实的逐步实验程序，EXP-Bench成为未来人工智能代理提升其进行人工智能研究实验能力的重要工具。EXP-Bench已在https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench开源。

English

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.

EXP-Bench：人工智能能否执行AI研究实验？

EXP-Bench: Can AI Conduct AI Research Experiments?

摘要

Support