科學委員會:評估多模態自主代理於現實科學工作流程中的表現
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
May 26, 2025
作者: Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu
cs.AI
摘要
大型语言模型(LLMs)的影响已超越自然语言处理领域,极大地推动了跨学科研究的发展。近期,多种基于LLM的智能体被开发出来,以协助科学发现的进程,涵盖多个方面与领域。其中,能够像人类一样与操作系统交互的计算机使用智能体,正为自动化解决科学问题及处理研究人员工作流程中的常规任务开辟道路。认识到这些智能体的变革潜力,我们推出了ScienceBoard,其包含两项互补性贡献:(i)一个现实的多领域环境,具备动态且视觉丰富的科学工作流程,并集成了专业软件,智能体可通过不同界面自主交互,以加速复杂研究任务与实验的完成;(ii)一个由人类精心策划的、包含169项高质量且严格验证的现实世界任务的挑战性基准,这些任务横跨生物化学、天文学及地理信息学等领域的科学发现工作流程。对搭载最先进核心(如GPT-4o、Claude 3.7、UI-TARS)的智能体进行的广泛评估显示,尽管取得了一些令人鼓舞的成果,它们在可靠协助科学家处理复杂工作流程方面仍显不足,整体成功率仅为15%。深入分析进一步为解决当前智能体局限性和设计更有效的原则提供了宝贵见解,为构建更强大的科学发现智能体铺平了道路。我们的代码、环境及基准可访问https://qiushisun.github.io/ScienceBoard-Home/获取。
English
Large Language Models (LLMs) have extended their impact beyond Natural
Language Processing, substantially fostering the development of
interdisciplinary research. Recently, various LLM-based agents have been
developed to assist scientific discovery progress across multiple aspects and
domains. Among these, computer-using agents, capable of interacting with
operating systems as humans do, are paving the way to automated scientific
problem-solving and addressing routines in researchers' workflows. Recognizing
the transformative potential of these agents, we introduce ScienceBoard, which
encompasses two complementary contributions: (i) a realistic, multi-domain
environment featuring dynamic and visually rich scientific workflows with
integrated professional software, where agents can autonomously interact via
different interfaces to accelerate complex research tasks and experiments; and
(ii) a challenging benchmark of 169 high-quality, rigorously validated
real-world tasks curated by humans, spanning scientific-discovery workflows in
domains such as biochemistry, astronomy, and geoinformatics. Extensive
evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude
3.7, UI-TARS) show that, despite some promising results, they still fall short
of reliably assisting scientists in complex workflows, achieving only a 15%
overall success rate. In-depth analysis further provides valuable insights for
addressing current agent limitations and more effective design principles,
paving the way to build more capable agents for scientific discovery. Our code,
environment, and benchmark are at
https://qiushisun.github.io/ScienceBoard-Home/.Summary
AI-Generated Summary