AutoGUI-v2:综合性多模态图形用户界面功能理解基准
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
April 27, 2026
作者: Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang
cs.AI
摘要
能够自主操作图形用户界面(GUI)的智能体具有彻底改变数字生产力的潜力。然而,实现真正的数字自主性不仅需要反应式的元素匹配,更需建立对界面动态的预测性心智模型,以及预见交互后"数字世界状态"的能力。尽管现代视觉语言模型(VLM)已具备感知能力,现有基准测试仍存在二元割裂(要么关注黑盒任务完成度,要么侧重静态浅层定位),无法评估智能体是否真正理解GUI的隐式功能与状态转换逻辑。为弥补这一空白,我们推出AutoGUI-v2——一个用于评估深度GUI功能理解与交互结果预测的综合基准。我们通过创新的VLM-人类协同流程构建该基准,以递归方式将多平台截图解析为层次化功能区域来生成多样化评估任务。AutoGUI-v2涵盖六大操作系统的2,753项任务,从区域/元素级语义理解、定位能力到动态状态预测进行系统化测试。评估结果揭示了VLM的显著能力分化:基于智能体数据微调的开源模型(如Qwen3-VL)擅长功能定位,而商用模型(如Gemini-2.5-Pro-Thinking)在功能描述方面表现卓越。关键发现是,所有模型对非常见操作的复杂交互逻辑均存在理解困难,表明深度功能理解仍是重大挑战。通过系统化衡量这些基础能力,AutoGUI-v2为推进下一代GUI智能体的发展提供了全新视角。
English
Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.