開放式CaptchaWorld：一個全面的基於網路的平台，用於測試與基準化多模態LLM代理

摘要

CAPTCHA 一直是部署網絡代理於實際應用中的關鍵瓶頸，常常阻礙其完成端到端的自動化任務。儘管現代多模態大語言模型（MLLM）代理在靜態感知任務中展現了令人印象深刻的性能，但其處理如 CAPTCHA 這類互動式、多步驟推理挑戰的能力卻鮮有測試。為填補這一空白，我們推出了 Open CaptchaWorld，這是首個專門設計用於通過多樣化且動態的 CAPTCHA 謎題來評估 MLLM 驅動代理的視覺推理與互動能力的基於網絡的基準測試平台。我們的基準涵蓋了 20 種現代 CAPTCHA 類型，總計 225 個 CAPTCHA，並配備了我們提出的新指標：CAPTCHA 推理深度，該指標量化了解決每個謎題所需的認知與操作步驟數量。實驗結果顯示，人類始終能接近滿分，而最先進的 MLLM 代理則表現掙扎，Browser-Use Openai-o3 的成功率最高僅為 40.0%，遠低於人類水平的 93.3%。這凸顯了 Open CaptchaWorld 作為診斷當前多模態代理局限性的重要基準，並為開發更強大的多模態推理系統提供了指導。代碼與數據可通過此 https 網址獲取。

English

CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.

開放式CaptchaWorld：一個全面的基於網路的平台，用於測試與基準化多模態LLM代理

Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

摘要

Support