EXP-Bench：人工智能能否开展AI研究实验？

摘要

自动化AI研究具有加速科学进步的巨大潜力，然而当前的AI智能体在应对严谨、端到端实验的复杂性方面仍面临挑战。我们推出了EXP-Bench，这是一个新颖的基准测试，旨在系统评估AI智能体在源自有影响力的AI出版物的完整研究实验中的表现。给定一个研究问题和不完整的初始代码，EXP-Bench挑战AI智能体提出假设、设计并实施实验程序、执行实验并分析结果。为了能够创建如此复杂且高保真的真实任务，我们设计了一个半自动化流程，从这些研究论文及其相关的开源代码中提取并结构化关键的实验细节。通过这一流程，EXP-Bench从51篇顶级AI研究论文中精选了461个AI研究任务。对基于大型语言模型（LLM）的领先智能体，如OpenHands和IterativeAgent在EXP-Bench上的评估显示，它们在个别实验方面的得分，如设计或实施正确性，偶尔能达到20-35%，但完成可执行实验的成功率仅为0.5%。通过识别这些瓶颈并提供现实的逐步实验程序，EXP-Bench成为未来AI智能体提升其进行AI研究实验能力的重要工具。EXP-Bench已在https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench开源。

English

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.

EXP-Bench：人工智能能否开展AI研究实验？

EXP-Bench: Can AI Conduct AI Research Experiments?

摘要

Support