ChatPaper.aiChatPaper

AstaBench:基于科研套件的智能代理系统严谨性基准测试平台

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

October 24, 2025
作者: Jonathan Bragg, Mike D'Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D. Hwang, Peter Jansen, Varsha Kishore, Bodhisattwa Prasad Majumder, Aakanksha Naik, Sigal Rahamimov, Kyle Richardson, Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu, Guy Wiener, Chloe Anastasiades, Stefan Candra, Jason Dunkelberger, Dan Emery, Rob Evans, Malachi Hamada, Regan Huff, Rodney Kinney, Matt Latzke, Jaron Lochner, Ruben Lozano-Aguilera, Cecile Nguyen, Smita Rao, Amber Tanaka, Brooke Vlahos, Peter Clark, Doug Downey, Yoav Goldberg, Ashish Sabharwal, Daniel S. Weld
cs.AI

摘要

人工智能代理有望通过自动化文献综述、实验复现、数据分析和提出新研究方向来彻底改变科研生产力。目前已有从通用型"深度研究"系统到专业科学代理(如AI Scientist和AIGS)等多种此类代理。对这些代理进行严格评估对领域发展至关重要,但现有基准测试存在多重不足:其一,未能针对科学研究等实际应用场景提供整体性、产品化的衡量标准;其二,缺乏可复现的代理工具以进行核心代理能力的受控比较;其三,未考虑模型成本与工具访问等混杂变量;其四,未提供标准化接口以支持快速代理原型设计与评估;其五,缺乏识别真实进展所需的全面基线代理。为此,我们提出了更严格代理基准测试的原则与工具集,并据此推出AstaBench——首个全面衡量科研能力的测试套件,包含2400多个覆盖完整科研流程与多学科领域的问题,其中许多问题源自已部署Asta代理的实际用户需求。该套件配备了首个具备生产级搜索工具的科研环境,支持受控可复现的评估,能更好控制混杂因素。同时我们提供了九类经科学优化的Asta代理及大量基线模型。通过对22类57种代理的广泛评估,我们发现了若干重要结论,最关键的是:尽管在特定方面取得实质性进展,人工智能仍远未达到解决科研辅助挑战的水平。
English
AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces for quick agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.
PDF31December 17, 2025