MLR-Bench:評估AI代理在開放式機器學習研究中的表現
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
May 26, 2025
作者: Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi
cs.AI
摘要
近期AI代理的進展已展現出其在推動與支持科學發現方面日益增長的潛力。在本研究中,我們介紹了MLR-Bench,這是一個用於評估AI代理在開放式機器學習研究上的全面基準測試。MLR-Bench包含三個關鍵組成部分:(1) 來自NeurIPS、ICLR和ICML研討會的201項研究任務,涵蓋多樣化的ML主題;(2) MLR-Judge,一個結合了基於LLM的評審者與精心設計的評審標準的自動化評估框架,用以評判研究質量;(3) MLR-Agent,一個模塊化的代理框架,能夠通過四個階段完成研究任務:創意生成、提案制定、實驗執行及論文撰寫。我們的框架支持對這些不同研究階段的逐步評估,以及對最終研究論文的端到端評價。隨後,我們利用MLR-Bench評估了六個前沿LLM和一個高級編程代理,發現雖然LLM在生成連貫想法和結構良好的論文方面表現出色,但當前的編程代理經常(例如,在80%的情況下)產生虛構或未經驗證的實驗結果——這對科學可靠性構成了重大障礙。我們通過人工評估驗證了MLR-Judge,顯示其與專家評審者之間的高度一致性,支持其作為研究評估可擴展工具的潛力。我們開源了MLR-Bench,以幫助社群基準測試、診斷並改進AI研究代理,朝著可信賴且透明的科學發現邁進。
English
Recent advancements in AI agents have demonstrated their growing potential to
drive and support scientific discovery. In this work, we introduce MLR-Bench, a
comprehensive benchmark for evaluating AI agents on open-ended machine learning
research. MLR-Bench includes three key components: (1) 201 research tasks
sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2)
MLR-Judge, an automated evaluation framework combining LLM-based reviewers with
carefully designed review rubrics to assess research quality; and (3)
MLR-Agent, a modular agent scaffold capable of completing research tasks
through four stages: idea generation, proposal formulation, experimentation,
and paper writing. Our framework supports both stepwise assessment across these
distinct research stages, and end-to-end evaluation of the final research
paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced
coding agent, finding that while LLMs are effective at generating coherent
ideas and well-structured papers, current coding agents frequently (e.g., in
80% of the cases) produce fabricated or invalidated experimental
results--posing a major barrier to scientific reliability. We validate
MLR-Judge through human evaluation, showing high agreement with expert
reviewers, supporting its potential as a scalable tool for research evaluation.
We open-source MLR-Bench to help the community benchmark, diagnose, and improve
AI research agents toward trustworthy and transparent scientific discovery.Summary
AI-Generated Summary