MMBench-GUI:面向GUI代理的分层多平台评估框架
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
July 25, 2025
作者: Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang
cs.AI
摘要
我们推出了MMBench-GUI,一个跨平台(涵盖Windows、macOS、Linux、iOS、Android及Web)的层次化基准测试,用于评估GUI自动化代理。该基准包含四个层级:GUI内容理解、元素定位、任务自动化及任务协作,全面覆盖了GUI代理所需的核心技能。此外,我们创新性地提出了效率-质量面积(EQA)指标,用以衡量在线自动化场景下GUI代理的执行效率。通过MMBench-GUI,我们发现精准的视觉定位是决定任务整体成功的关键因素,强调了集成专门定位模块的模块化框架所带来的显著优势。进一步地,实现可靠的GUI自动化,代理需具备强大的任务规划与跨平台泛化能力,其中长上下文记忆、广阔的动作空间及长期推理能力扮演着至关重要的角色。尤为重要的是,任务效率仍是一个被严重忽视的维度,所有模型均存在显著的效率低下问题,即便任务最终完成,也伴随着过多的冗余步骤。因此,整合精确定位、有效规划及早期停止策略,对于实现真正高效且可扩展的GUI自动化而言,是不可或缺的。我们的基准代码、评估数据及运行环境将公开于https://github.com/open-compass/MMBench-GUI。
English
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI
automation agents across Windows, macOS, Linux, iOS, Android, and Web
platforms. It comprises four levels: GUI Content Understanding, Element
Grounding, Task Automation, and Task Collaboration, covering essential skills
for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA)
metric to assess GUI agent execution efficiency in online automation scenarios.
Through MMBench-GUI, we identify accurate visual grounding as a critical
determinant of overall task success, emphasizing the substantial benefits of
modular frameworks that integrate specialized grounding modules. Furthermore,
to achieve reliable GUI automation, an agent requires strong task planning and
cross-platform generalization abilities, with long-context memory, a broad
action space, and long-term reasoning playing a critical role. More important,
task efficiency remains a critically underexplored dimension, and all models
suffer from substantial inefficiencies, with excessive redundant steps even
when tasks are ultimately completed. The integration of precise localization,
effective planning, and early stopping strategies is indispensable to enable
truly efficient and scalable GUI automation. Our benchmark code, evaluation
data, and running environment will be publicly available at
https://github.com/open-compass/MMBench-GUI.