开放验证世界:一个全面的网络平台,用于测试与评估多模态大语言模型代理
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
May 30, 2025
作者: Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen
cs.AI
摘要
CAPTCHA已成为在现实应用中部署网络代理的关键瓶颈,常常阻碍其完成端到端的自动化任务。尽管现代多模态大语言模型(MLLM)代理在静态感知任务中展现了令人瞩目的性能,但它们在处理如CAPTCHA这类交互式、多步骤推理挑战方面的能力尚未得到充分检验。为填补这一空白,我们推出了Open CaptchaWorld,这是首个专门设计用于通过多样化和动态的CAPTCHA谜题来评估MLLM驱动代理视觉推理与交互能力的网络基准与平台。我们的基准涵盖了20种现代CAPTCHA类型,总计225个CAPTCHA,并采用我们提出的新指标——CAPTCHA推理深度进行标注,该指标量化了解决每个谜题所需的认知与操作步骤数。实验结果显示,人类几乎总能获得接近满分的成绩,而最先进的MLLM代理则表现欠佳,其中Browser-Use Openai-o3的成功率最高仅为40.0%,远低于人类水平的93.3%。这凸显了Open CaptchaWorld作为诊断当前多模态代理局限性和指导开发更强大多模态推理系统的重要基准价值。代码与数据可通过此https链接获取。
English
CAPTCHAs have been a critical bottleneck for deploying web agents in
real-world applications, often blocking them from completing end-to-end
automation tasks. While modern multimodal LLM agents have demonstrated
impressive performance in static perception tasks, their ability to handle
interactive, multi-step reasoning challenges like CAPTCHAs is largely untested.
To address this gap, we introduce Open CaptchaWorld, the first web-based
benchmark and platform specifically designed to evaluate the visual reasoning
and interaction capabilities of MLLM-powered agents through diverse and dynamic
CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225
CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth,
which quantifies the number of cognitive and motor steps required to solve each
puzzle. Experimental results show that humans consistently achieve near-perfect
scores, state-of-the-art MLLM agents struggle significantly, with success rates
at most 40.0% by Browser-Use Openai-o3, far below human-level performance,
93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing
the limits of current multimodal agents and guiding the development of more
robust multimodal reasoning systems. Code and Data are available at this https
URL.Summary
AI-Generated Summary