エージェントの最終試験

要旨

近年のAIシステムは、多岐にわたるベンチマークで優れた成果を上げているが、これらの成果は多くの専門領域において経済的に意味のある展開には結びついていない。我々は、このギャップは主に評価の問題であると主張する。すなわち、広く使われているベンチマークは、現実的かつ経済的に価値のあるワークフローに対する持続的な性能測定を欠いているのである。本論文では、検証可能な成果を伴う長期的で経済的に価値のある現実世界のタスクにおいてAIエージェントを評価するために設計されたベンチマーク「Agents' Last Exam (ALE)」を紹介する。250名以上の業界専門家との協力により開発されたALEは、O*NET / SOC 2018（米国連邦職業分類）を参照して定義された非物理的な産業をカバーする。55のサブフィールドからなるタスク分類に基づいて構成されており、これらは13の産業クラスターにグループ化され、1,000以上のタスクを網羅している。現在の結果によると、最も難しい階層は依然として飽和状態には程遠い。主流のハーネスおよびバックボーン構成全体において、平均完全合格率は2.6%である。ALEは生きたベンチマークとして設計されており、新しいワークフローや業界が追加されるにつれてタスクプールは継続的に拡大する。より広く見れば、ALEは単なる新たなリーダーボードではなく、ベンチマークでの成功とGDPに関連する影響との間のギャップを埋めるための手段として意図されている。

English

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.