WebGames: Een uitdaging voor algemene web-browsing AI-agenten

Samenvatting

We introduceren WebGames, een uitgebreide benchmark suite ontworpen om algemene web-browsing AI-agenten te evalueren aan de hand van een verzameling van 50+ interactieve uitdagingen. Deze uitdagingen zijn specifiek ontworpen om eenvoudig te zijn voor mensen, terwijl ze systematisch de beperkingen van huidige AI-systemen testen op het gebied van fundamentele browserinteracties, geavanceerde invoerverwerking, cognitieve taken, workflowautomatisering en interactief entertainment. Ons framework elimineert externe afhankelijkheden door middel van een hermetische testomgeving, wat reproduceerbare evaluatie met verifieerbare grondwaarheid-oplossingen garandeert. We evalueren toonaangevende visie-taalmodellen, waaronder GPT-4o, Claude Computer-Use, Gemini-1.5-Pro en Qwen2-VL, tegenover menselijke prestaties. De resultaten tonen een aanzienlijk vermogensgat, waarbij het beste AI-systeem slechts een slagingspercentage van 43,1% behaalt in vergelijking met menselijke prestaties van 95,7%, wat fundamentele beperkingen benadrukt in het vermogen van huidige AI-systemen om veelvoorkomende webinteractiepatronen te hanteren die mensen intuïtief vinden. De benchmark is publiekelijk beschikbaar op webgames.convergence.ai en biedt een lichtgewicht, client-side implementatie die snelle evaluatiecycli mogelijk maakt. Door zijn modulaire architectuur en gestandaardiseerde uitdagingsspecificaties biedt WebGames een robuuste basis voor het meten van vooruitgang in de ontwikkeling van capabelere web-browsing agenten.

English

We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for humans while systematically testing the limitations of current AI systems across fundamental browser interactions, advanced input processing, cognitive tasks, workflow automation, and interactive entertainment. Our framework eliminates external dependencies through a hermetic testing environment, ensuring reproducible evaluation with verifiable ground-truth solutions. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems' ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at webgames.convergence.ai, offering a lightweight, client-side implementation that facilitates rapid evaluation cycles. Through its modular architecture and standardized challenge specifications, WebGames provides a robust foundation for measuring progress in development of more capable web-browsing agents.

WebGames: Een uitdaging voor algemene web-browsing AI-agenten

WebGames: Challenging General-Purpose Web-Browsing AI Agents

Samenvatting

Support