AutoLab：前沿模型能否解決長期的汽車研究與工程任務？

摘要

科學與工程進展本質上是一個長週期的迭代過程：提出變更、運行實驗、測量結果，並持續修正人工製品。然而，現有針對前沿模型的基準測試主要評估單輪回應或短週期代理軌跡，未能捕捉在長時間跨度內持續迭代改進的挑戰。為填補此缺口，我們推出 AutoLab，一個全新的超長期閉環優化基準測試。AutoLab 包含 36 項由專家精心設計的實際任務，涵蓋四個不同領域：系統優化、謎題與挑戰、模型開發，以及 CUDA 核心優化。每項任務從一個正確但刻意次優的基線開始，要求代理在嚴格的實際時間預算內進行改進。評估 17 個最先進模型後發現，成功的首要預測因子並非代理初次嘗試的品質，而是其反覆執行基準測試、編輯內容並納入實證回饋的持續性。儘管 claude-opus-4.6 展現出強大的長期優化能力，但多數前沿模型（包括數個專有模型）要嘛提前終止，要嘛在最小進展下耗盡預算。這些結果凸顯了時間意識與持續迭代在自主代理中的重要性。我們開源完整的基準測試、評估框架及任務資料，以加速邁向真正具備長期能力的代理研究。

English

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.