评估AI完成长任务的能力
Measuring AI Ability to Complete Long Tasks
March 18, 2025
作者: Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan
cs.AI
摘要
尽管人工智能在基准测试上取得了快速进展,但基准性能在现实世界中的意义仍不明确。为了量化人工智能系统相对于人类能力的能力,我们提出了一种新的度量标准:50%任务完成时间跨度。这是人类通常完成人工智能模型能以50%成功率完成的任务所需的时间。我们首先对具有相关领域专业知识的人类在RE-Bench、HCAST以及66个新颖的较短任务组合上进行了计时。在这些任务上,当前的前沿人工智能模型(如Claude 3.7 Sonnet)的50%时间跨度约为50分钟。此外,自2019年以来,前沿人工智能的时间跨度大约每七个月翻一番,尽管这一趋势在2024年可能有所加速。人工智能模型时间跨度的增加似乎主要由更高的可靠性和适应错误的能力驱动,同时结合了更好的逻辑推理和工具使用能力。我们讨论了研究结果的局限性——包括其外部效度——以及自主性增强对危险能力的影响。如果这些结果能够推广到现实世界的软件任务中,根据这一趋势的推断预测,在五年内,人工智能系统将能够自动化许多目前需要人类一个月时间完成的软件任务。
English
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark
performance remains unclear. To quantify the capabilities of AI systems in
terms of human capabilities, we propose a new metric: 50%-task-completion time
horizon. This is the time humans typically take to complete tasks that AI
models can complete with 50% success rate. We first timed humans with relevant
domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter
tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet
have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time
horizon has been doubling approximately every seven months since 2019, though
the trend may have accelerated in 2024. The increase in AI models' time
horizons seems to be primarily driven by greater reliability and ability to
adapt to mistakes, combined with better logical reasoning and tool use
capabilities. We discuss the limitations of our results -- including their
degree of external validity -- and the implications of increased autonomy for
dangerous capabilities. If these results generalize to real-world software
tasks, extrapolation of this trend predicts that within 5 years, AI systems
will be capable of automating many software tasks that currently take humans a
month.Summary
AI-Generated Summary