AIRS-Bench:面向前沿AI科研智能体的任务套件
AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents
February 6, 2026
作者: Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Chaurasia, Abhishek Charnalia, Derek Dunfield, Karen Hambardzumyan, Daniel Izcovich, Martin Josifoski, Ishita Mediratta, Kelvin Niu, Parth Pathak, Michael Shvartsman, Edan Toledo, Anton Protopopov, Roberta Raileanu, Alexander Miller, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach
cs.AI
摘要
大语言模型智能体在推动科学研究方面具有重要前景。为加速这一进程,我们推出AIRS-Bench(人工智能科研基准),该基准包含从顶尖机器学习论文中精选的20项任务,涵盖语言建模、数学、生物信息学和时间序列预测等多个领域。AIRS-Bench任务旨在评估智能体在完整科研生命周期中的能力——包括创意生成、实验分析与迭代优化——且不提供基准代码。该基准任务格式灵活通用,既可轻松集成新任务,又能实现不同智能体框架的严谨对比。我们采用前沿模型结合串行与并行框架建立了基线,结果显示智能体在四项任务中超越人类顶尖水平,但在其余十六项任务中未能企及。即使智能体超越人类基准,也未能达到底层任务的理论性能上限。这些发现表明AIRS-Bench远未达到饱和状态,存在巨大改进空间。我们开源AIRS-Bench任务定义与评估代码,以推动自主科研领域的进一步发展。
English
LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.