WebGames：挑戰通用型網頁瀏覽AI代理

摘要

我們推出WebGames，這是一個全面的基準測試套件，旨在通過50多個互動挑戰來評估通用網頁瀏覽AI代理的能力。這些挑戰特別設計為對人類而言直觀簡單，同時系統性地測試當前AI系統在基本瀏覽器互動、高級輸入處理、認知任務、工作流程自動化及互動娛樂等方面的局限。我們的框架通過一個封閉的測試環境消除了外部依賴，確保了可重現的評估與可驗證的真實解決方案。我們評估了包括GPT-4o、Claude Computer-Use、Gemini-1.5-Pro和Qwen2-VL在內的領先視覺語言模型，並與人類表現進行對比。結果顯示出顯著的能力差距，最佳AI系統的成功率僅為43.1%，而人類表現則達到95.7%，這突顯了當前AI系統在處理人類認為直觀的常見網頁互動模式上的根本限制。該基準測試公開於webgames.convergence.ai，提供了一個輕量級的客戶端實現，便於快速評估循環。通過其模塊化架構和標準化的挑戰規範，WebGames為衡量更強大網頁瀏覽代理的開發進展提供了堅實的基礎。

English

We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for humans while systematically testing the limitations of current AI systems across fundamental browser interactions, advanced input processing, cognitive tasks, workflow automation, and interactive entertainment. Our framework eliminates external dependencies through a hermetic testing environment, ensuring reproducible evaluation with verifiable ground-truth solutions. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems' ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at webgames.convergence.ai, offering a lightweight, client-side implementation that facilitates rapid evaluation cycles. Through its modular architecture and standardized challenge specifications, WebGames provides a robust foundation for measuring progress in development of more capable web-browsing agents.