ChatPaper.aiChatPaper

MLGym:推動AI研究智能體發展的新框架與基準平台

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

February 20, 2025
作者: Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu
cs.AI

摘要

我們推出了Meta MLGym和MLGym-Bench,這是一個用於評估和開發LLM(大型語言模型)代理在AI研究任務上的新框架和基準。這是首個專為機器學習(ML)任務設計的Gym環境,旨在促進針對訓練此類代理的強化學習(RL)算法的研究。MLGym-Bench包含了來自計算機視覺、自然語言處理、強化學習及博弈論等多個領域的13項多樣化且開放式的AI研究任務。解決這些任務需要具備真實世界中的AI研究技能,如生成新想法和假設、創建與處理數據、實施ML方法、訓練模型、運行實驗、分析結果,並通過此過程迭代以改進特定任務。我們在基準上評估了多款前沿大型語言模型,如Claude-3.5-Sonnet、Llama-3.1 405B、GPT-4o、o1-preview和Gemini-1.5 Pro。我們的MLGym框架便於添加新任務、集成與評估模型或代理、大規模生成合成數據,以及開發新的學習算法來訓練代理執行AI研究任務。我們發現,當前的前沿模型能夠通過找到更好的超參數來改進給定的基線,但通常不會產生新穎的假設、算法、架構或顯著的改進。我們開源了我們的框架和基準,以促進未來在提升LLM代理AI研究能力方面的研究。
English
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.

Summary

AI-Generated Summary

PDF1923February 21, 2025