AutoLab: 프론티어 모델이 장기적 자동차 연구 및 엔지니어링 과제를 해결할 수 있는가?

초록

과학 및 공학의 진전은 근본적으로 장기적 반복 과정, 즉 변경 제안, 실험 실행, 결과 측정, 결과물의 지속적 개선으로 이루어져 있다. 그러나 현존하는 최첨단 모델용 벤치마크는 주로 단일 턴 응답이나 단기적 에이전트 궤적만을 평가하며, 확장된 시간 지평에 걸친 지속적인 반복적 개선의 도전 과제를 포착하지 못한다. 이러한 격차를 해소하기 위해 우리는 초장기 폐쇄 루프 최적화를 위한 새로운 벤치마크인 AutoLab을 소개한다. AutoLab은 시스템 최적화, 퍼즐 및 도전 과제, 모델 개발, CUDA 커널 최적화라는 네 가지 다양한 영역에 걸친 36개의 현실적이고 전문가가 선별한 과제로 구성된다. 각 과제는 정확하지만 의도적으로 차선인 기준선으로 시작하며, 에이전트가 엄격한 실시간 예산 내에서 이를 개선하도록 요구한다. 17개의 최첨단 모델을 평가한 결과, 성공의 지배적 예측 변수는 에이전트의 초기 시도 품질이 아니라 반복적인 벤치마킹, 편집, 경험적 피드백 통합의 지속성임이 드러났다. claude-opus-4.6은 강력한 장기 최적화 능력을 보였지만, 여러 독점 모델을 포함한 대부분의 최첨단 모델은 조기에 종료되거나 최소한의 진전만으로 예산을 소진했다. 이러한 결과는 자율 에이전트에 있어 시간 인식과 지속적 반복의 중요성을 강조한다. 우리는 전체 벤치마크, 평가 도구, 과제 아티팩트를 오픈소스로 공개하여 진정으로 유능한 장기적 에이전트를 향한 연구를 가속화하고자 한다.

English

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.