AutoLLMResearch:訓練研究代理以自動化LLM實驗配置——從低成本學習,最佳化高成本
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
May 12, 2026
作者: Taicheng Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang
cs.AI
摘要
有效配置可扩展的大型语言模型(LLM)实验(涵盖架构设计、超参数调优等方面)对推动LLM研究至关重要,因为糟糕的配置选择会浪费大量计算资源,并阻碍模型充分发挥其潜力。以往的自动化方法仅适用于低成本场景,在这种场景下反复试错是可行的;然而,可扩展的LLM实验成本高昂,无法支持如此大规模迭代。据我们所知,尚无研究解决高成本LLM实验配置的自动化问题,导致该任务仍高度依赖人工且依赖专家直觉。受这一空白驱动,我们提出AutoLLMResearch——一个模仿人类研究者从低保真度实验中学习普适原则,并通过外推高效识别高成本LLM环境中潜力配置的智能体框架。其核心挑战在于如何让智能体通过与反映LLM配置空间结构的多保真度实验环境进行交互来学习。为此,我们提出一个系统性框架,包含两大关键组件:1)LLMConfig-Gym,一个涵盖四项核心LLM实验任务的多保真度环境,其背后拥有超过一百万GPU小时的可验证实验结果支撑;2)一套结构化训练流程,将配置研究形式化为长时域马尔可夫决策过程,并据此激励跨保真度外推推理。在保留实验上针对多种强基线进行的广泛评估证明了我们框架的有效性、泛化能力及可解释性,表明其作为可扩展真实世界LLM实验自动化的实用通用解决方案的潜力。
English
Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.