ChatPaper.aiChatPaper

ResearchGym:在真實世界人工智慧研究環境中評估語言模型代理

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

February 16, 2026
作者: Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan
cs.AI

摘要

我們推出ResearchGym——一個用於評估AI智能體端到端科研能力的基準測試與執行環境。為實現此目標,我們重新利用來自ICML、ICLR和ACL的五篇口頭報告與亮點論文:保留各論文代碼庫中的數據集、評估框架與基準實現,但隱去論文提出的核心方法,最終構建出五個容器化任務環境(共含39項子任務)。在每個環境中,智能體需提出新假設、運行實驗,並嘗試在論文指標上超越強力的人工基準。通過對GPT-5驅動的智能體進行受控評估,我們發現其存在顯著的「能力-可靠性差距」:在15次評估中僅有1次(6.7%)以11.5%的優勢超越代碼庫提供的基準,平均僅完成26.5%的子任務。我們識別出反覆出現的長週期失效模式,包括缺乏耐心、時間與資源管理不當、對薄弱假設過度自信、並行實驗協調困難,以及上下文長度帶來的硬性限制。然而在單次運行中,該智能體成功超越了ICML 2025某亮點任務的解決方案,表明前沿智能體雖能偶爾達到頂尖水平,但表現極不穩定。我們還評估了Claude Code(Opus-4.5)和Codex(GPT-5.2)等專有智能體框架,它們同樣呈現類似差距。ResearchGym為系統性評估與分析自主智能體的閉環科研能力提供了基礎設施。
English
We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.
PDF143February 19, 2026