科学委员会：评估多模态自主代理在真实科学工作流程中的表现

摘要

大型语言模型（LLMs）的影响已超越自然语言处理领域，极大地推动了跨学科研究的发展。近期，多种基于LLM的智能体被开发出来，以协助科学发现进程，覆盖多个方面和领域。其中，能够像人类一样与操作系统交互的计算机使用智能体，正在为自动化科学问题解决和研究人员工作流程中的常规任务处理开辟道路。认识到这些智能体的变革潜力，我们推出了ScienceBoard，它包含两项互补性贡献：（一）一个现实的多领域环境，集成了动态且视觉丰富的科学工作流程与专业软件，智能体可通过不同界面自主交互，以加速复杂研究任务和实验；（二）一个由人类精心策划的、包含169项高质量且严格验证的现实世界任务的挑战性基准，涵盖生物化学、天文学和地理信息学等领域的科学发现工作流程。对搭载最先进架构（如GPT-4o、Claude 3.7、UI-TARS）的智能体进行的广泛评估显示，尽管取得了一些令人鼓舞的成果，它们在可靠协助科学家完成复杂工作流程方面仍显不足，整体成功率仅为15%。深入分析进一步为解决当前智能体局限性和设计更有效的原则提供了宝贵见解，为构建更强大的科学发现智能体铺平了道路。我们的代码、环境和基准可在https://qiushisun.github.io/ScienceBoard-Home/获取。

English

Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.