ClawBench: Gli Agenti di IA Possono Completare Attività Online Quotidiane?

Abstract

Gli agenti IA potrebbero automatizzare la tua casella di posta, ma sono in grado di automatizzare altri aspetti routinari della tua vita? I compiti online quotidiani offrono un banco di prova realistico ma irrisolto per valutare la prossima generazione di agenti IA. A questo scopo, presentiamo ClawBench, un framework di valutazione composto da 153 compiti semplici che le persone devono svolgere regolarmente nella vita e nel lavoro, abbracciando 144 piattaforme live in 15 categorie, dal completamento di acquisti e prenotazione di appuntamenti all'invio di candidature lavorative. Questi compiti richiedono capacità più impegnative rispetto ai benchmark esistenti, come ottenere informazioni rilevanti da documenti forniti dall'utente, navigare flussi di lavoro multi-step su piattaforme diverse e operazioni ad alta intensità di scrittura come compilare correttamente moduli dettagliati. A differenza dei benchmark esistenti che valutano gli agenti in sandbox offline con pagine statiche, ClawBench opera su siti web in produzione, preservando la piena complessità, natura dinamica e sfide dell'interazione web nel mondo reale. Un livello di intercettazione leggero cattura e blocca solo la richiesta di invio finale, garantendo una valutazione sicura senza effetti collaterali nel mondo reale. Le nostre valutazioni di 7 modelli all'avanguardia mostrano che sia i modelli proprietari che quelli open-source riescono a completare solo una piccola porzione di questi compiti. Ad esempio, Claude Sonnet 4.6 raggiunge solo il 33.3%. Il progresso su ClawBench ci avvicina ad agenti IA in grado di funzionare come assistenti generalisti affidabili.

English

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

ClawBench: Gli Agenti di IA Possono Completare Attività Online Quotidiane?

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Abstract

Support