AutoLLMResearch：LLM実験設定を自動化する研究エージェントの訓練――安価なものから学び、高価なものを最適化する

要旨

効果的にスケーラブルな大規模言語モデル（LLM）実験を構成することは、アーキテクチャ設計やハイパーパラメータ調整などを含め、LLM研究を前進させる上で極めて重要である。不適切な構成選択は多大な計算リソースを浪費し、モデルがその潜在能力を最大限に発揮するのを妨げるからである。従来の自動化手法は、繰り返しの試行錯誤が可能な低コストの設定向けに設計されているが、スケーラブルなLLM実験はそのような広範な反復を行うにはコストが高すぎる。我々の知る限り、高コストなLLM実験構成の自動化に取り組んだ研究はなく、この問題は人手と専門家の直感に依存したままである。このギャップに動機づけられ、我々はAutoLLMResearchを提案する。これは、人間の研究者が低忠実度実験から一般化可能な原理を学習し、それを外挿して高コストなLLM設定で有望な構成を効率的に特定する方法を模倣するエージェントフレームワークである。核心的な課題は、LLM構成のランドスケープ構造を捉えたマルチフィデリティ実験環境との相互作用を通じて、エージェントが学習できるようにする方法である。これを達成するため、我々は2つの主要コンポーネントからなる体系的なフレームワークを提案する。1) LLMConfig-Gym：4つの重要なLLM実験タスクを包含し、100万GPU時間以上の検証可能な実験結果によって支えられたマルチフィデリティ環境。2) 構成研究を長期マルコフ決定過程として定式化し、それに応じてクロスフィデリティ外挿推論を促進する構造化トレーニングパイプライン。多様な強力なベースラインとの広範な評価をホールドアウト実験で実施した結果、我々のフレームワークの有効性、汎化性能、解釈可能性が実証され、現実世界のスケーラブルなLLM実験自動化のための実用的かつ一般的なソリューションとしての可能性が支持された。

English

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

AutoLLMResearch：LLM実験設定を自動化する研究エージェントの訓練――安価なものから学び、高価なものを最適化する

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

要旨

Support