AutoGUI-v2:综合性多模态图形用户界面功能理解基准测试平台
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
April 27, 2026
作者: Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang
cs.AI
摘要
能够自主操作图形用户界面(GUI)的智能体具有彻底改变数字生产力的潜力。然而,实现真正的数字自主性不仅需要反应式的元素匹配,更需建立对界面动态的预测性心智模型,以及预判交互后"数字世界状态"的能力。尽管现代视觉语言模型(VLMs)具备感知能力,现有基准测试仍存在两极分化——要么关注黑盒任务完成度,要么侧重静态浅层定位,均无法评估智能体是否真正理解GUI的隐式功能与转换逻辑。为此,我们推出AutoGUI-v2这一综合性基准,专门评估深度GUI功能理解与交互结果预测能力。通过采用新型VLM-人类协同流程,递归解析多平台截图为层级化功能区域以生成多样化评估任务,该基准涵盖六大操作系统的2,753项任务,严格检验智能体在区域/元素级语义理解、定位及动态状态预测方面的表现。评估结果揭示VLMs存在显著能力分化:基于智能体数据微调的开源模型(如Qwen3-VL)擅长功能定位,而商业模型(如Gemini-2.5-Pro-Thinking)在功能描述方面占优。关键发现是,所有模型对非常见操作的复杂交互逻辑均表现不佳,表明深度功能理解仍是重大挑战。通过系统化衡量这些基础能力,AutoGUI-v2为推进下一代GUI智能体发展提供了全新视角。
English
Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.