ChatPaper.aiChatPaper

VenusBench-GD:面向多样化接地任务的全方位多平台图形界面基准测试框架

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

December 18, 2025
作者: Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen
cs.AI

摘要

图形用户界面(GUI) grounding 是构建高效能 GUI 智能体的关键组成部分。然而,现有的 grounding 基准测试存在显著局限性:它们要么提供的数据量不足且领域覆盖狭窄,要么过度聚焦单一平台并需要高度专业化的领域知识。本研究提出 VenusBench-GD——一个跨平台、双语言的综合性 GUI grounding 基准测试,支持面向实际应用的分层评估。该基准的贡献包括:(i)推出覆盖广泛应用程序、多样化 UI 元素及丰富标注数据的大规模跨平台基准;(ii)建立了面向 grounding 任务的高质量数据构建流程,标注准确率超越现有基准;(iii)通过提出分层任务分类法将 grounding 划分为基础与高级两大类别,涵盖六个旨在从互补视角评估模型的子任务,从而扩展了元素 grounding 的范畴。实验发现揭示了关键洞察:通用多模态模型在基础 grounding 任务上已媲美甚至超越专用 GUI 模型,而高级任务仍更适用于 GUI 专用模型,尽管后者存在明显过拟合和鲁棒性不足的问题。这些结果印证了构建全面多层次评估框架的必要性。
English
GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.
PDF82December 20, 2025