FML-bench:自动机器学习研究代理基准,着重强调探索广度的重要性
FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth
October 12, 2025
作者: Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu
cs.AI
摘要
大型语言模型(LLMs)激发了人们对自动化机器学习研究代理日益增长的兴趣。其中,能够自主提出想法并开展机器学习实验的代理尤为引人注目,它们通过基于实验结果迭代优化想法,最大限度地实现了研究自动化,加速了科学进步。然而,全面评估此类代理仍面临挑战。现有基准往往过分强调工程层面而忽视学术严谨性,这为清晰评估代理在机器学习研究中的科学能力设置了障碍。此外,这些基准还存在任务多样性不足、过于侧重应用导向任务而非基础研究问题,以及难以扩展至真实研究场景等问题。为应对这些局限,我们推出了FML-bench,一个旨在评估自动化机器学习研究代理在8个多样化且基础的机器学习研究问题上表现的基准。它减轻了编码负担,强调基础问题而非特定用例,提供了高任务多样性,并能扩展至现实世界的机器学习GitHub仓库。进一步地,我们提出了一个包含五项互补指标的统一评估框架,旨在全面评估代理在我们基准上的表现。我们在FML-bench上评估了最先进的自动化研究代理,发现采用广泛研究探索策略的代理优于那些专注于狭窄但深入探索的代理。这些发现表明,强调探索的广度可能比单纯关注增量优化带来更有效的研究成果。我们的基准可在https://github.com/qrzou/FML-bench获取。
English
Large language models (LLMs) have sparked growing interest in automatic
machine learning research agents. Among them, agents capable of autonomously
proposing ideas and conducting machine learning experiments are particularly
promising, as they maximize research automation and accelerate scientific
progress by iteratively refining ideas based on experimental results. However,
comprehensively evaluating such agents remains challenging. Existing benchmarks
tend to overemphasize engineering aspects while neglecting academic rigor,
creating barriers that obscure a clear assessment of an agent's scientific
capabilities in machine learning research. They also suffer from limited task
diversity, an overemphasis on application-oriented tasks over fundamental
research problems, and limited scalability to realistic research settings. To
address these limitations, we introduce FML-bench, a benchmark designed to
evaluate automatic machine learning research agents on 8 diverse and
fundamental machine learning research problems. It reduces coding burden,
emphasizes fundamental problems rather than specific use cases, offers high
task diversity, and is extensible to real-world machine learning GitHub
repositories. Furthermore, we present a unified evaluation framework with five
complementary metrics, designed to comprehensively assess agent performance on
our benchmark. We evaluate state-of-the-art automatic research agents on
FML-bench, and find that agents employing broad research exploration strategies
outperform those focusing on narrow but deep exploration. These findings
suggest that emphasizing the breadth of exploration may lead to more effective
research outcomes than focusing solely on incremental refinement. Our benchmark
is available at https://github.com/qrzou/FML-bench.