AutoLLMResearch:训练研究智能体来自动化LLM实验配置——从低成本中学习,对高成本进行优化
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
May 12, 2026
作者: Taicheng Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang
cs.AI
摘要
有效配置可扩展的大型语言模型实验——涵盖架构设计、超参数调优等多个方面——对推动大模型研究至关重要,因为不当的配置决策不仅会浪费大量计算资源,还会阻碍模型发挥其全部潜力。现有的自动化方法主要面向低成本场景,允许反复试错,但可扩展的大语言模型实验成本过高,难以承受此类大量迭代。据我们所知,目前尚无研究涉及高成本大模型实验配置的自动化问题,导致该领域仍依赖人工经验和专家直觉。针对这一空白,我们提出AutoLLMResearch——一种模拟人类研究者从低保真度实验中学习通用原理,并将其外推以高效识别昂贵大模型场景中潜在可行配置的智能体框架。其核心挑战在于如何让智能体通过与多保真度实验环境(该环境捕捉了大模型配置空间的底层结构)的交互进行学习。为实现这一目标,我们提出一个系统框架,包含两个关键组件:1) LLMConfig-Gym,一个涵盖四项关键大模型实验任务的多保真度环境,其背后拥有超百万GPU小时的可验证实验数据支撑;2) 结构化训练流水线,将配置研究建模为长期马尔可夫决策过程,并据此激励跨保真度外推推理。在留出实验上针对多种强基线进行的大量评估表明,我们的框架在有效性、泛化性和可解释性方面均具优势,有望成为实现可扩展真实大模型实验自动化的实用通用解决方案。
English
Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.