ClawBench：AI代理能否完成日常線上任務？

摘要

AI代理或許能自動化處理你的收件匣，但它們能否自動化你生活中的其他常規事務？日常線上任務為評估下一代AI代理提供了一個現實且尚未解決的測試平台。為此，我們推出ClawBench評估框架，包含153項人們在生活與工作中需定期完成的簡單任務，橫跨15個類別的144個線上平台，範圍涵蓋完成購物、預約服務到提交求職申請等。這些任務要求的能力遠超現有基準測試，例如從用戶提供的文件中獲取相關資訊、在多樣化平台間導航多步驟工作流程，以及填寫大量詳細表格等書寫密集型操作。與現有在靜態頁面離線沙盒中評估代理的基準不同，ClawBench在實際運行的網站上操作，完整保留了真實網路互動的複雜性、動態特性與挑戰。透過輕量級攔截層僅捕獲並阻斷最終提交請求，確保評估安全性且不產生現實副作用。我們對7個前沿模型的評估顯示，無論專有模型或開源模型都僅能完成其中少部分任務。例如Claude Sonnet 4.6的成功率僅達33.3%。ClawBench的進展讓我們更接近能作為可靠通用助手的AI代理。

English

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

ClawBench：AI代理能否完成日常線上任務？

ClawBench: Can AI Agents Complete Everyday Online Tasks?

摘要

Support