ClawBench: AI 에이전트가 일상적인 온라인 작업을 완료할 수 있을까?

초록

AI 에이전트가 이메일 수신함을 자동화할 수는 있지만, 일상의 다른 반복적인 업무도 자동화할 수 있을까? 일상적인 온라인 작업들은 차세대 AI 에이전트를 평가하기 위한 현실적이면서도 아직 해결되지 않은 테스트베드를 제공합니다. 이를 위해 우리는 ClawBench를 소개합니다. 이는 사람들이 삶과 업무에서 정기적으로 수행해야 하는 153개의 간단한 작업으로 구성된 평가 프레임워크로, 구매 완료, 약속 예약부터 구직 지원서 제출에 이르기까지 15개 범주, 144개의 실시간 플랫폼에 걸쳐 있습니다. 이러한 작업들은 사용자가 제공한 문서에서 관련 정보를 획득하고, 다양한 플랫폼을 넘나드는 다단계 워크플로를 탐색하며, 많은 세부 양식을 정확하게 작성하는 것과 같은 쓰기 중심 작업과 같이 기존 벤치마크를 넘어서는 까다로운 역량을 요구합니다. 정적 페이지로 구성된 오프라인 샌드박스에서 에이전트를 평가하는 기존 벤치마크와 달리, ClawBench는 실제 운영 중인 웹사이트에서 작동하여 현실 세계의 웹 상호작용이 가진 모든 복잡성, 동적 특성, 그리고 난제를 그대로 보존합니다. 경량화된 차단 계층이 최종 제출 요청만을 포착 및 차단하여 실제 부작용 없이 안전한 평가를 보장합니다. 7개의 최첨단 모델에 대한 우리의 평가 결과, 사유 모델과 오픈소스 모델 모두 이러한 작업의 극히 일부만 완료할 수 있었습니다. 예를 들어, Claude Sonnet 4.6의 성공률은 33.3%에 불과했습니다. ClawBench에서의 진전은 신뢰할 수 있는 범용 어시스턴트로 기능할 수 있는 AI 에이전트에 한 걸음 더 가까워지는 길입니다.

English

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

ClawBench: AI 에이전트가 일상적인 온라인 작업을 완료할 수 있을까?

ClawBench: Can AI Agents Complete Everyday Online Tasks?

초록

Support