MLRC-Bench: Gli agenti linguistici possono risolvere le sfide della ricerca nel machine learning?

Abstract

La valutazione esistente degli agenti basati su modelli linguistici di grandi dimensioni (LLM) nella scoperta scientifica manca di baseline oggettive e metriche per valutare la fattibilità dei metodi proposti. Per affrontare questo problema, introduciamo MLRC-Bench, un benchmark progettato per quantificare quanto efficacemente gli agenti linguistici possano affrontare competizioni di ricerca impegnative nel campo del Machine Learning (ML). Il nostro benchmark mette in evidenza problemi di ricerca aperti che richiedono metodologie innovative, in contrasto con benchmark recenti come MLE-Bench di OpenAI (Chan et al., 2024) e RE-Bench di METR (Wijk et al., 2024), che si concentrano su compiti di ricerca consolidati e largamente risolvibili attraverso un sufficiente sforzo ingegneristico. A differenza di lavori precedenti, come AI Scientist (Lu et al., 2024b), che valutano la pipeline agentica end-to-end utilizzando LLM come giudice, MLRC-Bench misura i passaggi chiave di proposta e implementazione di nuovi metodi di ricerca e li valuta con un protocollo rigoroso e metriche oggettive appositamente proposte. La nostra suite curata di 7 task di competizione rivela sfide significative per gli agenti LLM. Anche l'agente con le migliori prestazioni testato (gemini-exp-1206 sotto MLAB (Huang et al., 2024a)) chiude solo il 9,3% del divario tra i punteggi di baseline e quelli dei migliori partecipanti umani. Inoltre, la nostra analisi rivela una disallineamento tra l'innovazione giudicata dagli LLM e le loro prestazioni effettive su problemi di ricerca all'avanguardia nel ML. MLRC-Bench è un benchmark dinamico, progettato per crescere continuamente con nuove competizioni di ML, al fine di incoraggiare valutazioni rigorose e oggettive delle capacità di ricerca dell'IA.

English

Existing evaluation of large language model (LLM) agents on scientific discovery lacks objective baselines and metrics to assess the viability of their proposed methods. To address this issue, we introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions. Our benchmark highlights open research problems that demand novel methodologies, in contrast to recent benchmarks such as OpenAI's MLE-Bench (Chan et al., 2024) and METR's RE-Bench (Wijk et al., 2024), which focus on well-established research tasks that are largely solvable through sufficient engineering effort. Unlike prior work, e.g., AI Scientist (Lu et al., 2024b), which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with newly proposed rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB (Huang et al., 2024a)) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and their actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, which is designed to continually grow with new ML competitions to encourage rigorous and objective evaluations of AI's research capabilities.

MLRC-Bench: Gli agenti linguistici possono risolvere le sfide della ricerca nel machine learning?

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Abstract

Support