AutoGUI-v2：综合性多模态图形用户界面功能理解基准测试平台

摘要

能够自主操作图形用户界面(GUI)的智能体具有彻底改变数字生产力的潜力。然而，实现真正的数字自主性不仅需要反应式的元素匹配，更需建立对界面动态的预测性心智模型，以及预判交互后"数字世界状态"的能力。尽管现代视觉语言模型(VLMs)具备感知能力，现有基准测试仍存在两极分化——要么关注黑盒任务完成度，要么侧重静态浅层定位，均无法评估智能体是否真正理解GUI的隐式功能与转换逻辑。为此，我们推出AutoGUI-v2这一综合性基准，专门评估深度GUI功能理解与交互结果预测能力。通过采用新型VLM-人类协同流程，递归解析多平台截图为层级化功能区域以生成多样化评估任务，该基准涵盖六大操作系统的2,753项任务，严格检验智能体在区域/元素级语义理解、定位及动态状态预测方面的表现。评估结果揭示VLMs存在显著能力分化：基于智能体数据微调的开源模型（如Qwen3-VL）擅长功能定位，而商业模型（如Gemini-2.5-Pro-Thinking）在功能描述方面占优。关键发现是，所有模型对非常见操作的复杂交互逻辑均表现不佳，表明深度功能理解仍是重大挑战。通过系统化衡量这些基础能力，AutoGUI-v2为推进下一代GUI智能体发展提供了全新视角。

English

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.

AutoGUI-v2：综合性多模态图形用户界面功能理解基准测试平台

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

摘要

Support