RExBench: 코딩 에이전트가 AI 연구 확장을 자율적으로 구현할 수 있을까?

초록

대형 언어 모델(LLM) 기반 에이전트는 정교한 소프트웨어 엔지니어링 작업을 자율적으로 수행하는 데 유망한 가능성을 보여주고 있습니다. 또한, 머신러닝 및 자연과학 분야의 연구 파이프라인 일부를 수행할 수 있는 에이전트 개발에도 진전이 있었습니다. 우리는 연구 확장 및 그 구현이 이러한 시스템의 핵심 역량이라고 주장하며, 이 역량을 평가하기 위해 RExBench를 소개합니다. RExBench는 이전에 구현된 적 없는 연구 가설을 조사하기 위한 12개의 현실적인 연구 실험 구현 작업으로 구성된 벤치마크입니다. 각 작업은 기존 연구 논문 및 코드베이스의 확장으로 설정되며, 해당 분야 전문가가 작성한 지침이 함께 제공됩니다. RExBench는 데이터 오염에 강건하며, 에이전트 출력을 실행하여 성공 기준이 충족되는지 확인할 수 있는 자동 평가 인프라를 지원합니다. 우리는 이 벤치마크를 사용하여 aider, Claude Code, OpenHands라는 세 가지 다른 프레임워크를 사용해 구현된 9개의 LLM 에이전트를 평가했습니다. 평가된 모든 에이전트가 대부분의 확장을 자율적으로 구현하지 못한 것으로 나타났습니다. 추가적인 인간 작성 힌트를 통해 성공률이 향상되긴 했지만, 이 설정에서의 최고 성능도 40% 미만으로 나타났습니다. 이는 현재의 에이전트가 상당한 인간의 지도 없이 현실적인 연구 확장 작업을 처리할 수 있는 수준에 이르지 못했음을 시사합니다.

English

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

RExBench: 코딩 에이전트가 AI 연구 확장을 자율적으로 구현할 수 있을까?

RExBench: Can coding agents autonomously implement AI research extensions?

초록

Support