针对跨尺度科学挑战的AI智能体基准测试

摘要

AI代理正日益被开发用于加速科学发现，但它们在真实研究环境中的实际能力仍缺乏深入理解。现有针对AI代理的基准测试很少捕捉科学工作所需的复杂性、异质性和扩展推理能力，而面向科学任务的基准测试往往将研究简化为静态的直接问题，对交互式评估的支持有限。本文提出SciAgentArena——一个系统化的基准测试框架，用于评估AI代理在多领域新兴需求驱动的真实科学研究场景中的表现。SciAgentArena包含约200项任务，配备分步验证机制和一个与代理无关的交互式环境，可评估不同类型的AI代理。通过该基准测试，我们发现当前AI代理在任务结构和评估标准明确的场景下，能够有效参与具体指定的数据分析工作流。然而，其表现因科学领域不同而参差不齐：代理在生成真正新颖的见解、维持自主探索方向以及为开放式研究问题制定稳健解决方案方面仍存在困难。我们进一步归纳了各代理的常见失败模式，并识别了提升其可靠性、自主性和科学推理能力的潜在改进方向。综上，SciAgentArena为衡量AI代理在科学领域的进展提供了实用框架，同时为设计能够应对复杂科学挑战的未来代理提供了指导。全部代码、任务和数据集可通过以下链接获取：https://sciagentarena.github.io/。

English

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.