RExBench:編碼代理能否自主實現人工智慧研究的擴展?
RExBench: Can coding agents autonomously implement AI research extensions?
June 27, 2025
作者: Nicholas Edwards, Yukyung Lee, Yujun, Mao, Yulu Qin, Sebastian Schuster, Najoung Kim
cs.AI
摘要
基於大型語言模型(LLMs)的代理在自主執行複雜軟體工程任務方面展現出潛力。此外,在開發能夠完成機器學習與自然科學研究流程部分工作的代理方面,也取得了進展。我們認為,研究擴展及其實現是此類系統的關鍵能力,並引入RExBench以支持對這一能力的評估。RExBench是一個包含12項現實研究實驗實現任務的基準,旨在探討尚未被實施的研究假設。每項任務均設置為對現有研究論文及代碼庫的擴展,並附有領域專家撰寫的指導說明。RExBench對數據污染具有魯棒性,並支持自動評估基礎設施,該設施執行代理輸出以判斷是否滿足成功標準。我們利用此基準評估了使用三種不同框架(aider、Claude Code和OpenHands)實現的九個LLM代理。我們發現,所有被評估的代理均未能自主實現大部分擴展。儘管在提供額外人工提示後成功率有所提升,但在該設定下的最佳表現仍低於40%。這表明,當前代理在無需大量人工指導的情況下處理現實研究擴展任務的能力仍有不足。
English
Agents based on Large Language Models (LLMs) have shown promise for
performing sophisticated software engineering tasks autonomously. In addition,
there has been progress towards developing agents that can perform parts of the
research pipeline in machine learning and the natural sciences. We argue that
research extension and its implementation is a critical capability for such
systems, and introduce RExBench to support the evaluation of this capability.
RExBench is a benchmark consisting of 12 realistic research experiment
implementation tasks that aim to investigate research hypotheses that have not
previously been implemented. Each task is set up as an extension to an existing
research paper and codebase, accompanied by domain expert-written instructions.
RExBench is robust to data contamination, and supports an automatic evaluation
infrastructure that executes agent outputs to determine whether the success
criteria are met. We use this benchmark to evaluate nine LLM agents implemented
using three different frameworks: aider, Claude Code, and OpenHands. We find
that all agents evaluated fail to autonomously implement the majority of the
extensions. Although the success rate improves with additional human-written
hints, the best performance under this setting remains below 40%. This
indicates that current agents are still short of being able to handle realistic
research extension tasks without substantial human guidance.