AIRS-Bench:前沿人工智慧研究科學代理的任務套件
AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents
February 6, 2026
作者: Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Chaurasia, Abhishek Charnalia, Derek Dunfield, Karen Hambardzumyan, Daniel Izcovich, Martin Josifoski, Ishita Mediratta, Kelvin Niu, Parth Pathak, Michael Shvartsman, Edan Toledo, Anton Protopopov, Roberta Raileanu, Alexander Miller, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach
cs.AI
摘要
大型語言模型代理在推動科學研究方面具有重要前景。為加速這一進程,我們推出AIRS-Bench(人工智慧科研基準測試),該套件包含從頂尖機器學習論文中精選的20項任務。這些任務涵蓋語言建模、數學、生物資訊學與時間序列預測等多個領域。AIRS-Bench任務旨在評估代理在完整科研生命周期中的能力——包括創意生成、實驗分析與迭代優化——且不提供基礎程式碼。AIRS-Bench的任務格式具備高度靈活性,可輕鬆整合新任務並實現不同代理框架間的嚴謹比較。我們採用前沿模型搭配順序式與並行式架構建立了基準測試。結果顯示,代理在四項任務中超越人類最先進水平,但在其餘十六項任務中未能達標。即使代理超越人類基準,也未能觸及底層任務的理論性能上限。這些發現表明AIRS-Bench遠未達飽和狀態,存在巨大改進空間。我們開源了AIRS-Bench任務定義與評估程式碼,以促進自主科學研究的進一步發展。
English
LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.