ChatPaper.aiChatPaper

VenusBench-GD:面向多样化接地任务的全平台图形界面基准测试框架

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

December 18, 2025
作者: Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen
cs.AI

摘要

GUI定位是构建高效能GUI智能体的关键组成部分。然而现有定位基准存在显著局限:要么数据量不足且领域覆盖狭窄,要么过度聚焦单一平台并需要高度专业化的领域知识。本研究提出VenusBench-GD——一个跨平台、双语言的综合GUI定位基准,支持现实应用场景的分层评估。该基准的贡献包括:(一)推出覆盖海量应用、多样化UI元素及丰富标注数据的大规模跨平台基准;(二)建立面向定位任务的高质量数据构建流程,实现比现有基准更高的标注精度;(三)通过提出分层任务分类法扩展元素定位范畴,将定位划分为基础与高级两大类别,涵盖六个设计用于从互补视角评估模型的子任务。实验发现揭示关键洞察:通用多模态模型在基础定位任务上已媲美甚至超越专用GUI模型,而高级任务仍更青睐GUI专用模型,尽管后者存在明显过拟合和鲁棒性不足的问题。这些结果凸显了建立全面多层次评估框架的必要性。
English
GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.
PDF82December 20, 2025