MMBench-GUI:面向GUI代理的层次化多平台评估框架
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
July 25, 2025
作者: Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang
cs.AI
摘要
我們推出了MMBench-GUI,這是一個分層次的基準測試,用於評估跨Windows、macOS、Linux、iOS、Android及Web平台的圖形用戶界面(GUI)自動化代理。該基準包含四個層級:GUI內容理解、元素定位、任務自動化及任務協作,涵蓋了GUI代理所需的核心技能。此外,我們提出了一種新穎的效率-質量面積(EQA)指標,用以評估在線自動化場景下GUI代理的執行效率。通過MMBench-GUI,我們發現精確的視覺定位是決定整體任務成功的關鍵因素,強調了集成專門定位模塊的模塊化框架所帶來的顯著優勢。進一步地,要實現可靠的GUI自動化,代理需要具備強大的任務規劃與跨平台泛化能力,其中長上下文記憶、廣闊的行動空間及長期推理扮演著至關重要的角色。更重要的是,任務效率仍是一個極少被探索的維度,所有模型都存在顯著的效率低下問題,即便任務最終完成,也伴隨著過多的冗餘步驟。精確定位、有效規劃及早期停止策略的整合,對於實現真正高效且可擴展的GUI自動化不可或缺。我們的基準代碼、評估數據及運行環境將公開於https://github.com/open-compass/MMBench-GUI。
English
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI
automation agents across Windows, macOS, Linux, iOS, Android, and Web
platforms. It comprises four levels: GUI Content Understanding, Element
Grounding, Task Automation, and Task Collaboration, covering essential skills
for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA)
metric to assess GUI agent execution efficiency in online automation scenarios.
Through MMBench-GUI, we identify accurate visual grounding as a critical
determinant of overall task success, emphasizing the substantial benefits of
modular frameworks that integrate specialized grounding modules. Furthermore,
to achieve reliable GUI automation, an agent requires strong task planning and
cross-platform generalization abilities, with long-context memory, a broad
action space, and long-term reasoning playing a critical role. More important,
task efficiency remains a critically underexplored dimension, and all models
suffer from substantial inefficiencies, with excessive redundant steps even
when tasks are ultimately completed. The integration of precise localization,
effective planning, and early stopping strategies is indispensable to enable
truly efficient and scalable GUI automation. Our benchmark code, evaluation
data, and running environment will be publicly available at
https://github.com/open-compass/MMBench-GUI.