FML-bench:一個專為自動化機器學習研究代理設計的基準測試,強調探索廣度的重要性
FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth
October 12, 2025
作者: Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu
cs.AI
摘要
大型語言模型(LLMs)激發了對自動機器學習研究代理日益增長的興趣。其中,能夠自主提出想法並進行機器學習實驗的代理尤其具有前景,因為它們通過基於實驗結果的迭代改進,最大限度地實現了研究自動化並加速了科學進展。然而,全面評估此類代理仍具挑戰性。現有的基準測試往往過於強調工程層面,而忽視了學術嚴謹性,這造成了障礙,模糊了對代理在機器學習研究中科學能力的清晰評估。這些基準還存在任務多樣性有限、過於側重應用導向任務而非基礎研究問題,以及在現實研究環境中的可擴展性不足等問題。為解決這些局限性,我們引入了FML-bench,這是一個旨在評估自動機器學習研究代理在8個多樣化且基礎的機器學習研究問題上表現的基準。它減少了編碼負擔,強調基礎問題而非特定用例,提供了高任務多樣性,並且可擴展至現實世界的機器學習GitHub倉庫。此外,我們提出了一個統一的評估框架,包含五個互補的指標,旨在全面評估代理在我們基準上的表現。我們在FML-bench上評估了最先進的自動研究代理,發現採用廣泛研究探索策略的代理優於那些專注於狹窄但深入探索的代理。這些發現表明,強調探索的廣度可能比僅專注於增量改進帶來更有效的研究成果。我們的基準可在https://github.com/qrzou/FML-bench獲取。
English
Large language models (LLMs) have sparked growing interest in automatic
machine learning research agents. Among them, agents capable of autonomously
proposing ideas and conducting machine learning experiments are particularly
promising, as they maximize research automation and accelerate scientific
progress by iteratively refining ideas based on experimental results. However,
comprehensively evaluating such agents remains challenging. Existing benchmarks
tend to overemphasize engineering aspects while neglecting academic rigor,
creating barriers that obscure a clear assessment of an agent's scientific
capabilities in machine learning research. They also suffer from limited task
diversity, an overemphasis on application-oriented tasks over fundamental
research problems, and limited scalability to realistic research settings. To
address these limitations, we introduce FML-bench, a benchmark designed to
evaluate automatic machine learning research agents on 8 diverse and
fundamental machine learning research problems. It reduces coding burden,
emphasizes fundamental problems rather than specific use cases, offers high
task diversity, and is extensible to real-world machine learning GitHub
repositories. Furthermore, we present a unified evaluation framework with five
complementary metrics, designed to comprehensively assess agent performance on
our benchmark. We evaluate state-of-the-art automatic research agents on
FML-bench, and find that agents employing broad research exploration strategies
outperform those focusing on narrow but deep exploration. These findings
suggest that emphasizing the breadth of exploration may lead to more effective
research outcomes than focusing solely on incremental refinement. Our benchmark
is available at https://github.com/qrzou/FML-bench.