EXP-Bench: AI가 AI 연구 실험을 수행할 수 있는가?

초록

AI 연구 자동화는 과학적 진보를 가속화할 수 있는 엄청난 잠재력을 지니고 있지만, 현재의 AI 에이전트들은 엄격하고 종단 간(end-to-end) 실험의 복잡성을 다루는 데 어려움을 겪고 있습니다. 우리는 영향력 있는 AI 논문에서 가져온 완전한 연구 실험을 통해 AI 에이전트를 체계적으로 평가하기 위해 새로운 벤치마크인 EXP-Bench를 소개합니다. 연구 질문과 불완전한 시작 코드가 주어졌을 때, EXP-Bench는 AI 에이전트가 가설을 수립하고, 실험 절차를 설계 및 구현하며, 이를 실행하고 결과를 분석하도록 요구합니다. 이러한 복잡하고 진정성 있는 작업을 고품질로 생성할 수 있도록, 우리는 연구 논문과 관련 오픈소스 코드에서 중요한 실험 세부 사항을 추출하고 구조화하는 반자율 파이프라인을 설계했습니다. 이 파이프라인을 통해 EXP-Bench는 51편의 최상위 AI 연구 논문에서 461개의 AI 연구 과제를 선별했습니다. OpenHands 및 IterativeAgent와 같은 선도적인 LLM 기반 에이전트를 EXP-Bench에서 평가한 결과, 설계나 구현 정확성과 같은 개별 실험 측면의 점수가 가끔 20-35%에 도달하지만, 완전히 실행 가능한 실험의 성공률은 단 0.5%에 불과했습니다. 이러한 병목 현상을 식별하고 현실적인 단계별 실험 절차를 제공함으로써, EXP-Bench는 향후 AI 에이전트가 AI 연구 실험을 수행하는 능력을 향상시키기 위한 필수적인 도구로 기능합니다. EXP-Bench는 https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench에서 오픈소스로 제공됩니다.

English

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.

EXP-Bench: AI가 AI 연구 실험을 수행할 수 있는가?

EXP-Bench: Can AI Conduct AI Research Experiments?

초록

Support