科學委員會：評估多模態自主代理於現實科學工作流程中的表現

摘要

大型语言模型（LLMs）的影响已超越自然语言处理领域，极大地推动了跨学科研究的发展。近期，多种基于LLM的智能体被开发出来，以协助科学发现的进程，涵盖多个方面与领域。其中，能够像人类一样与操作系统交互的计算机使用智能体，正为自动化解决科学问题及处理研究人员工作流程中的常规任务开辟道路。认识到这些智能体的变革潜力，我们推出了ScienceBoard，其包含两项互补性贡献：（i）一个现实的多领域环境，具备动态且视觉丰富的科学工作流程，并集成了专业软件，智能体可通过不同界面自主交互，以加速复杂研究任务与实验的完成；（ii）一个由人类精心策划的、包含169项高质量且严格验证的现实世界任务的挑战性基准，这些任务横跨生物化学、天文学及地理信息学等领域的科学发现工作流程。对搭载最先进核心（如GPT-4o、Claude 3.7、UI-TARS）的智能体进行的广泛评估显示，尽管取得了一些令人鼓舞的成果，它们在可靠协助科学家处理复杂工作流程方面仍显不足，整体成功率仅为15%。深入分析进一步为解决当前智能体局限性和设计更有效的原则提供了宝贵见解，为构建更强大的科学发现智能体铺平了道路。我们的代码、环境及基准可访问https://qiushisun.github.io/ScienceBoard-Home/获取。

English

Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

科學委員會：評估多模態自主代理於現實科學工作流程中的表現

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

摘要

Support