EXP-Bench: AIはAI研究実験を遂行できるか？

要旨

AI研究の自動化は科学的進歩を加速する大きな可能性を秘めているが、現在のAIエージェントは厳密なエンドツーエンドの実験の複雑さに対処するのに苦労している。本研究では、影響力のあるAI研究論文から抽出された完全な研究実験を基に、AIエージェントを体系的に評価するための新しいベンチマーク「EXP-Bench」を提案する。研究課題と不完全なスターターコードが与えられた場合、EXP-BenchはAIエージェントに対し、仮説を立て、実験手順を設計・実装し、実行し、結果を分析することを求める。このような複雑で現実的なタスクを高精度で作成するために、研究論文とその関連オープンソースコードから重要な実験詳細を抽出し、構造化する半自律的なパイプラインを設計した。このパイプラインを用いて、EXP-Benchは51のトップクラスのAI研究論文から461のAI研究タスクをキュレーションした。OpenHandsやIterativeAgentなどの主要なLLMベースのエージェントをEXP-Benchで評価した結果、設計や実装の正確性などの個々の実験側面のスコアが20～35％に達することはあるものの、完全に実行可能な実験の成功率はわずか0.5％であった。これらのボトルネックを特定し、現実的なステップバイステップの実験手順を提供することで、EXP-Benchは将来のAIエージェントがAI研究実験を遂行する能力を向上させるための重要なツールとして機能する。EXP-Benchはhttps://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_benchでオープンソースとして公開されている。

English

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.

EXP-Bench: AIはAI研究実験を遂行できるか？

EXP-Bench: Can AI Conduct AI Research Experiments?

要旨

Support