開放式CaptchaWorld:一個全面的基於網路的平台,用於測試與基準化多模態LLM代理
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
May 30, 2025
作者: Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen
cs.AI
摘要
CAPTCHA 一直是部署網絡代理於實際應用中的關鍵瓶頸,常常阻礙其完成端到端的自動化任務。儘管現代多模態大語言模型(MLLM)代理在靜態感知任務中展現了令人印象深刻的性能,但其處理如 CAPTCHA 這類互動式、多步驟推理挑戰的能力卻鮮有測試。為填補這一空白,我們推出了 Open CaptchaWorld,這是首個專門設計用於通過多樣化且動態的 CAPTCHA 謎題來評估 MLLM 驅動代理的視覺推理與互動能力的基於網絡的基準測試平台。我們的基準涵蓋了 20 種現代 CAPTCHA 類型,總計 225 個 CAPTCHA,並配備了我們提出的新指標:CAPTCHA 推理深度,該指標量化了解決每個謎題所需的認知與操作步驟數量。實驗結果顯示,人類始終能接近滿分,而最先進的 MLLM 代理則表現掙扎,Browser-Use Openai-o3 的成功率最高僅為 40.0%,遠低於人類水平的 93.3%。這凸顯了 Open CaptchaWorld 作為診斷當前多模態代理局限性的重要基準,並為開發更強大的多模態推理系統提供了指導。代碼與數據可通過此 https 網址獲取。
English
CAPTCHAs have been a critical bottleneck for deploying web agents in
real-world applications, often blocking them from completing end-to-end
automation tasks. While modern multimodal LLM agents have demonstrated
impressive performance in static perception tasks, their ability to handle
interactive, multi-step reasoning challenges like CAPTCHAs is largely untested.
To address this gap, we introduce Open CaptchaWorld, the first web-based
benchmark and platform specifically designed to evaluate the visual reasoning
and interaction capabilities of MLLM-powered agents through diverse and dynamic
CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225
CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth,
which quantifies the number of cognitive and motor steps required to solve each
puzzle. Experimental results show that humans consistently achieve near-perfect
scores, state-of-the-art MLLM agents struggle significantly, with success rates
at most 40.0% by Browser-Use Openai-o3, far below human-level performance,
93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing
the limits of current multimodal agents and guiding the development of more
robust multimodal reasoning systems. Code and Data are available at this https
URL.