AutoLab：前沿模型能否解决长期汽车研究与工程任务？

摘要

科学与工程进步本质上是一个长周期迭代过程：提出变更、运行实验、测量结果，并持续优化人工制品。然而，现有前沿模型基准测试主要评估单轮响应或短周期智能体轨迹，未能捕捉长时间跨度内持续迭代改进的挑战。为填补这一空白，我们提出了AutoLab——一个面向超长周期闭环优化的全新基准。AutoLab包含36个由专家精心设计的真实任务，涵盖四个不同领域：系统优化、谜题与挑战、模型开发以及CUDA内核优化。每个任务均以正确但刻意次优的基线为起点，要求智能体在严格的挂钟时间预算内对其进行改进。对17个前沿模型的评估显示，成功的主要预测因素并非智能体首次尝试的质量，而是其反复执行基准测试、编辑并整合经验反馈的持续性。尽管claude-opus-4.6展现出强大的长周期优化能力，但大多数前沿模型（包括若干专有模型）要么过早终止，要么在预算耗尽时进展甚微。这些结果凸显了时间感知与持续迭代在自主智能体中的重要性。我们开源了完整的基准测试、评估框架及任务组件，以加速迈向真正具备长周期能力的智能体研究。

English

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.