跨尺度科學挑戰之AI智能體基準評估
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
June 10, 2026
作者: Tianyu Liu, Allen Xin Wang, Antonia Panescu, Lisa Xinyi Chen, Wenxin Long, Xinyu Wei, Yueqian Jing, Ziyao Zeng, Jihang Chen, Sihan Jiang, Ziqing Wang, Siyi Gu, Siyu Chen, Xinyang Hu, Haoran Shao, Leqi Xu, Wangjie Zheng, Zhiyuan Cao, Ada Fang, Botao Yu, Kunyang Sun, Rex Ying, Arman Cohan, Qingyu Chen, Lingzhou Xue, Kaize Ding, Yuanqi Du, Wengong Jin, Zhuoran Yang, Marinka Zitnik, James Zou, Hua Xu, Hongyu Zhao
cs.AI
摘要
AI代理正日益被用於加速科學發現,然而它們在真實研究場景中的實際能力仍未被充分理解。現有的AI代理基準測試很少能捕捉科學工作所需的複雜性、異質性及延伸推理,而科學任務的基準測試往往將研究簡化為靜態、直接的題目,並對互動式評估提供有限支持。在此,我們介紹SciAgentArena,這是一個系統性的基準測試,旨在評估AI代理在多個領域新興需求驅動的真實科學研究場景中的表現。SciAgentArena包含約200項任務,具備逐步驗證機制,並提供一個互動式、與代理無關的環境,用於評估多樣的AI代理。透過此基準測試,我們發現當前的代理能夠在明確定義的數據分析工作流程中有效貢獻,特別是在任務結構與評估標準清晰的情況下。然而,它們在不同科學情境中的表現仍不均衡:代理難以產生真正新穎的見解、維持自我導向的探索,以及為開放式研究問題制定穩健的解決方案。我們進一步歸納了代理間的常見失敗模式,並找出提升其可靠性、自主性及科學推理能力的機會。總體而言,SciAgentArena提供了一個實用的框架,用以衡量AI代理在科學領域的進展,並引導未來能應對複雜科學挑戰之代理的設計。完整程式碼、任務與數據集可透過此連結取得:https://sciagentarena.github.io/。
English
AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.