ChatPaper.aiChatPaper

ResearchGym:在真实AI研究场景下评估语言模型智能体

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

February 16, 2026
作者: Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan
cs.AI

摘要

我们推出ResearchGym——一个用于评估AI智能体端到端科研能力的基准测试与执行环境。为实现这一目标,我们重新利用了来自ICML、ICLR和ACL的五篇口头报告与焦点论文。通过保留各论文代码库中的数据集、评估框架和基线实现,但隐去论文提出的核心方法,最终构建出五个容器化任务环境,共包含39项子任务。在每个环境中,智能体需要提出新假设、运行实验,并尝试在论文指标上超越人类建立的强基线。通过对基于GPT-5的智能体进行受控评估,我们观察到显著的能力-可靠性差距:在15次评估中,该智能体仅1次(6.7%)以11.5%的优势超越代码库提供的基线,平均仅完成26.5%的子任务。我们识别出重复出现的长期性失效模式,包括实验耐心不足、时间与资源管理不当、对弱假设过度自信、并行实验协调困难,以及上下文长度带来的硬性限制。然而在单次运行中,该智能体成功超越了ICML 2025焦点任务的解决方案,表明前沿智能体虽能偶尔达到顶尖水平,但表现极不稳定。我们还评估了Claude Code(Opus-4.5)和Codex(GPT-5.2)等专有智能体框架,它们同样表现出类似差距。ResearchGym为系统评估和分析自主智能体在闭环科研中的表现提供了基础设施。
English
We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.
PDF143February 19, 2026