AutoLab: フロンティアモデルは長期にわたる自動車研究およびエンジニアリングタスクを解決できるか？

要旨

科学・工学的進歩は、本質的に長期的な反復プロセスである。すなわち、変更を提案し、実験を実行し、成果を測定し、成果物を継続的に改良していくことである。しかし、現存する最先端モデルのベンチマークは主に単一ターンの応答か、短期間のエージェント軌跡を評価するにとどまり、長期にわたる持続的な反復改良の課題を捉えきれていない。このギャップを埋めるため、我々はAutoLabを導入する。AutoLabは、超長期間のクローズドループ最適化を対象とした新しいベンチマークである。AutoLabは、システム最適化、パズル・チャレンジ、モデル開発、CUDAカーネル最適化という4つの多様な領域にわたる、専門家が厳選した36の現実的なタスクで構成される。各タスクは、正しいが意図的に準最適なベースラインから始まり、エージェントは厳格なウォールクロック予算内でそれを改善するよう求められる。17の最先端モデルを評価した結果、成功の主な予測因子はエージェントの初期試行の質ではなく、ベンチマークの反復実行、編集、経験的フィードバックの取り込みにおける持続性であることが明らかになった。claude-opus-4.6は強力な長期最適化能力を示す一方、複数のプロプライエタリモデルを含むほとんどの最先端モデルは、早期に終了するか、最小限の進歩で予算を使い果たしている。これらの結果は、自律エージェントにおける時間認識と持続的な反復の重要性を強調するものである。我々は、真に長期的な能力を持つエージェントへの研究を加速するため、ベンチマーク全体、評価ハーネス、タスク成果物をオープンソース化する。

English

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.