ClawBench: AIエージェントは日常的なオンライン作業を遂行できるか？

要旨

AIエージェントは受信箱を自動化できるかもしれませんが、日常生活の他のルーティン業務も自動化できるのでしょうか？日常的なオンライン作業は、次世代AIエージェントを評価するための現実的でありながら未解決のテストベッドを提供します。この目的のために、私たちはClawBenchを紹介します。これは、購買完了や予約取得から求職応募まで、15のカテゴリーにわたる144のライブプラットフォームで人々が定期的に達成する必要がある153のシンプルなタスクからなる評価フレームワークです。これらのタスクは、ユーザー提供の文書から関連情報を取得する、多様なプラットフォームにわたる多段階ワークフローのナビゲート、多数の詳細なフォームを正確に記入するような記述量の多い操作など、既存のベンチマークを超える高度な能力を要求します。静的ページによるオフラインサンドボックスでエージェントを評価する既存のベンチマークとは異なり、ClawBenchは本番環境のウェブサイト上で動作し、実世界のウェブインタラクションの完全な複雑性、動的な性質、課題を保持します。軽量なインターセプション層が最終送信リクエストのみを捕捉してブロックし、実世界への副作用なく安全な評価を保証します。7つの先進モデルに対する評価では、プロプライエタリモデルとオープンソースモデルの両方が、これらのタスクのごく一部しか完了できないことが示されています。例えば、Claude Sonnet 4.6は33.3%の達成率に留まります。ClawBenchにおける進展は、信頼性の高い汎用アシスタントとして機能するAIエージェントの実現に私たちを近づけます。

English

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

ClawBench: AIエージェントは日常的なオンライン作業を遂行できるか？

ClawBench: Can AI Agents Complete Everyday Online Tasks?

要旨

Support