智能体的最终考试

摘要

近期AI系统在众多基准测试中取得了优异表现，但这些成果并未转化为许多专业领域内具有经济意义的实际部署。我们认为这一偏差本质上是评估问题：广泛使用的基准测试缺乏对真实且具有经济价值的工作流程进行持续性能测量。本文介绍"智能体终极考试"(Agents' Last Exam, ALE)，这是一个面向AI智能体设计的基准测试，旨在评估其在长周期、高经济价值、结果可验证的真实世界任务中的表现。ALE由250多位行业专家合作开发，覆盖以O*NET/SOC 2018（美国联邦职业分类体系）为参照的非实体产业。该基准围绕任务分类体系构建，包含13个产业集群下的55个子领域，涵盖1000余项任务。当前结果表明，最具挑战层级的任务远未达到饱和状态：在主流框架与基础配置下，平均完全通过率仅为2.6%。ALE被设计为动态基准：随着新工作流程和产业领域的持续接入，其任务库将不断扩充。更广泛而言，ALE的定位不仅是另一个排行榜，更是弥合基准测试成功与GDP相关影响力之间差距的工具。

English

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.