ClawBench：AI智能体能否胜任日常在线任务？

摘要

AI代理或许能自动处理收件箱，但能否自动化生活中的其他日常事务？日常在线任务为评估下一代AI代理提供了现实且尚未解决的测试平台。为此，我们推出ClawBench评估框架，包含153个人们在生活工作中需要定期完成的简单任务，横跨15个类别的144个实时平台，涵盖完成购物、预约服务到提交求职申请等场景。这些任务要求的能力远超现有基准测试：需要从用户提供的文档中获取相关信息、在多样化平台上完成多步骤流程操作，以及正确填写大量详细表格等重度书写任务。与在静态页面离线沙盒中评估代理的现有基准不同，ClawBench在真实网站环境中运行，完整保留了现实网络交互的复杂性、动态性和挑战性。通过轻量级拦截层仅捕获并阻断最终提交请求，确保评估过程安全无实际副作用。我们对7个前沿模型的评估表明，无论是专有模型还是开源模型，目前仅能完成其中少量任务。例如Claude Sonnet 4.6的成功率仅为33.3%。在ClawBench上的进展将推动AI代理向可靠通用助手的目标迈进。

English

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.