FML-bench：自动机器学习研究代理基准，着重强调探索广度的重要性

摘要

大型语言模型（LLMs）激发了人们对自动化机器学习研究代理日益增长的兴趣。其中，能够自主提出想法并开展机器学习实验的代理尤为引人注目，它们通过基于实验结果迭代优化想法，最大限度地实现了研究自动化，加速了科学进步。然而，全面评估此类代理仍面临挑战。现有基准往往过分强调工程层面而忽视学术严谨性，这为清晰评估代理在机器学习研究中的科学能力设置了障碍。此外，这些基准还存在任务多样性不足、过于侧重应用导向任务而非基础研究问题，以及难以扩展至真实研究场景等问题。为应对这些局限，我们推出了FML-bench，一个旨在评估自动化机器学习研究代理在8个多样化且基础的机器学习研究问题上表现的基准。它减轻了编码负担，强调基础问题而非特定用例，提供了高任务多样性，并能扩展至现实世界的机器学习GitHub仓库。进一步地，我们提出了一个包含五项互补指标的统一评估框架，旨在全面评估代理在我们基准上的表现。我们在FML-bench上评估了最先进的自动化研究代理，发现采用广泛研究探索策略的代理优于那些专注于狭窄但深入探索的代理。这些发现表明，强调探索的广度可能比单纯关注增量优化带来更有效的研究成果。我们的基准可在https://github.com/qrzou/FML-bench获取。

English

Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.

FML-bench：自动机器学习研究代理基准，着重强调探索广度的重要性

FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

摘要

Support