MMBench-GUI：面向GUI代理的分层多平台评估框架

摘要

我们推出了MMBench-GUI，一个跨平台（涵盖Windows、macOS、Linux、iOS、Android及Web）的层次化基准测试，用于评估GUI自动化代理。该基准包含四个层级：GUI内容理解、元素定位、任务自动化及任务协作，全面覆盖了GUI代理所需的核心技能。此外，我们创新性地提出了效率-质量面积（EQA）指标，用以衡量在线自动化场景下GUI代理的执行效率。通过MMBench-GUI，我们发现精准的视觉定位是决定任务整体成功的关键因素，强调了集成专门定位模块的模块化框架所带来的显著优势。进一步地，实现可靠的GUI自动化，代理需具备强大的任务规划与跨平台泛化能力，其中长上下文记忆、广阔的动作空间及长期推理能力扮演着至关重要的角色。尤为重要的是，任务效率仍是一个被严重忽视的维度，所有模型均存在显著的效率低下问题，即便任务最终完成，也伴随着过多的冗余步骤。因此，整合精确定位、有效规划及早期停止策略，对于实现真正高效且可扩展的GUI自动化而言，是不可或缺的。我们的基准代码、评估数据及运行环境将公开于https://github.com/open-compass/MMBench-GUI。

English

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.

MMBench-GUI：面向GUI代理的分层多平台评估框架

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

摘要

Support