ChatPaper.aiChatPaper

MLR-Bench:评估AI代理在开放式机器学习研究中的表现

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

May 26, 2025
作者: Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi
cs.AI

摘要

近期AI代理的进展展现了其在推动和支持科学发现方面日益增长的潜力。本研究中,我们推出了MLR-Bench,一个用于评估AI代理在开放式机器学习研究上的综合基准。MLR-Bench包含三大核心组件:(1) 源自NeurIPS、ICLR及ICML研讨会的201项研究任务,涵盖多样化的机器学习主题;(2) MLR-Judge,一个结合了基于大语言模型(LLM)评审员与精心设计的评审标准的自动化评估框架,用以评定研究质量;(3) MLR-Agent,一个模块化代理框架,能够通过四个阶段完成研究任务:创意生成、提案制定、实验执行及论文撰写。我们的框架不仅支持对这些不同研究阶段的逐步评估,也支持对最终研究论文的端到端评价。随后,我们利用MLR-Bench评估了六种前沿LLM及一款高级编码代理,发现尽管LLM在生成连贯想法和结构良好的论文方面表现优异,但当前的编码代理在多数情况下(例如80%的案例中)会产生虚构或未经证实的实验结果——这构成了科学可靠性的重大障碍。我们通过人工评估验证了MLR-Judge,显示其与专家评审员高度一致,支持其作为可扩展研究评估工具的潜力。我们开源了MLR-Bench,旨在帮助社区对AI研究代理进行基准测试、诊断与改进,朝着可信赖且透明的科学发现迈进。
English
Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.

Summary

AI-Generated Summary

PDF81May 27, 2025