규모별 과학적 과제 해결을 위한 AI 에이전트 벤치마킹

초록

AI 에이전트는 과학적 발견을 가속화하기 위해 점점 더 많이 개발되고 있지만, 실제 연구 환경에서의 실용적 역량은 여전히 잘 이해되지 않고 있다. AI 에이전트를 위한 기존 벤치마크는 과학 작업에 요구되는 복잡성, 이질성, 확장된 추론 과정을 거의 포착하지 못하는 반면, 과학 작업을 위한 벤치마크는 연구를 정적이고 직접적인 문제로 축소하고 상호작용적 평가를 위한 지원이 제한적이다. 본 논문에서는 여러 분야의 신흥 요구로부터 도출된 실제 과학 연구 시나리오에서 AI 에이전트를 평가하기 위한 체계적인 벤치마크인 SciAgentArena를 소개한다. SciAgentArena는 단계별 검증을 포함한 약 200개의 작업과 다양한 AI 에이전트를 평가하기 위한 상호작용적이고 에이전트에 구애받지 않는 환경으로 구성된다. 이 벤치마크를 사용하여, 현재 에이전트는 특히 작업 구조와 평가 기준이 명확할 때 잘 정의된 데이터 분석 워크플로우에 효과적으로 기여할 수 있음을 발견했다. 그러나 과학적 맥락에 따라 성능은 고르지 않았다: 에이전트는 진정으로 새로운 통찰력을 생성하고, 자기 주도적 탐색을 유지하며, 개방형 연구 질문에 대한 강력한 해결책을 공식화하는 데 어려움을 겪었다. 또한 에이전트 간 공통적인 실패 모드를 특성화하고, 신뢰성, 자율성 및 과학적 추론을 개선할 수 있는 기회를 식별했다. 종합하면, SciAgentArena는 과학을 위한 AI 에이전트의 진전을 측정하고 복잡한 과학적 과제를 해결할 수 있는 미래 에이전트 설계를 안내하는 실용적인 프레임워크를 제공한다. 전체 코드, 작업 및 데이터셋은 다음 링크에서 확인할 수 있다: https://sciagentarena.github.io/.

English

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.