ScienceBoard: 현실적인 과학 워크플로우에서 다중모달 자율 에이전트 평가

초록

대형 언어 모델(LLMs)은 자연어 처리의 범위를 넘어 다양한 학제 간 연구의 발전을 크게 촉진하고 있습니다. 최근에는 과학적 발견 과정을 여러 측면과 영역에서 지원하기 위해 다양한 LLM 기반 에이전트가 개발되었습니다. 이 중에서도 인간과 마찬가지로 운영 체제와 상호작용할 수 있는 컴퓨터 사용 에이전트는 연구자들의 업무 흐름에서 자동화된 과학적 문제 해결과 일상적인 업무 처리를 위한 길을 열고 있습니다. 이러한 에이전트의 변혁적 잠재력을 인식하여, 우리는 ScienceBoard를 소개합니다. ScienceBoard는 두 가지 상호 보완적인 기여를 포함합니다: (i) 다양한 인터페이스를 통해 자율적으로 상호작용할 수 있는 통합 전문 소프트웨어와 함께 동적이고 시각적으로 풍부한 과학적 워크플로우를 특징으로 하는 현실적이고 다중 도메인 환경으로, 복잡한 연구 작업과 실험을 가속화할 수 있습니다; (ii) 생화학, 천문학, 지리정보학과 같은 도메인에서 과학적 발견 워크플로우를 아우르는 169개의 고품질이고 엄격하게 검증된 실제 작업으로 구성된 도전적인 벤치마크입니다. 최첨단 백본(예: GPT-4o, Claude 3.7, UI-TARS)을 가진 에이전트에 대한 광범위한 평가는, 일부 유망한 결과에도 불구하고, 복잡한 워크플로우에서 과학자들을 안정적으로 지원하기에는 아직 부족하며, 전체 성공률이 15%에 불과함을 보여줍니다. 심층 분석은 현재 에이전트의 한계를 해결하고 더 효과적인 설계 원칙을 제공하는 데 유용한 통찰을 제공하며, 과학적 발견을 위한 더 능력 있는 에이전트를 구축하는 길을 열어줍니다. 우리의 코드, 환경, 벤치마크는 https://qiushisun.github.io/ScienceBoard-Home/에서 확인할 수 있습니다.

English

Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

ScienceBoard: 현실적인 과학 워크플로우에서 다중모달 자율 에이전트 평가

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

초록

Support