에이전트들의 마지막 시험

초록

최근 AI 시스템은 다양한 벤치마크에서 뛰어난 성과를 거두었지만, 이러한 성과가 많은 전문 분야에서 경제적으로 의미 있는 배포로 이어지지는 않았다. 우리는 이러한 격차가 주로 평가 문제, 즉 널리 사용되는 벤치마크가 실제적이고 경제적 가치가 있는 워크플로에 대한 지속적 성능 측정을 제공하지 못하기 때문이라고 주장한다. 본 논문에서는 장기적이고 경제적 가치가 있으며 검증 가능한 결과를 요구하는 실제 업무에 대해 AI 에이전트를 평가하기 위한 벤치마크인 Agents' Last Exam (ALE)을 소개한다. 250명 이상의 업계 전문가와 협력하여 개발된 ALE는 O*NET/SOC 2018(미국 연방 직업 분류 체계)을 기준으로 정의된 비물리적 산업을 다룬다. 55개 하위 분야가 13개 산업 클러스터로 그룹화된 작업 분류 체계를 중심으로 구성되며 1,000개 이상의 작업을 포함한다. 현재 결과에 따르면 가장 어려운 계층은 여전히 포화 상태와 거리가 멀다: 주류 하네스 및 백본 구성에서 평균 전체 통과율은 2.6%이다. ALE는 살아있는 벤치마크로 설계되어, 새로운 워크플로와 산업이 추가됨에 따라 작업 풀이 지속적으로 확장된다. 더 넓게 보면, ALE는 단순한 리더보드가 아니라 벤치마크 성공과 GDP 관련 영향 간의 격차를 해소하기 위한 도구로 의도되었다.

English

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.