MMBench-GUI：面向GUI代理的层次化多平台评估框架

摘要

我們推出了MMBench-GUI，這是一個分層次的基準測試，用於評估跨Windows、macOS、Linux、iOS、Android及Web平台的圖形用戶界面（GUI）自動化代理。該基準包含四個層級：GUI內容理解、元素定位、任務自動化及任務協作，涵蓋了GUI代理所需的核心技能。此外，我們提出了一種新穎的效率-質量面積（EQA）指標，用以評估在線自動化場景下GUI代理的執行效率。通過MMBench-GUI，我們發現精確的視覺定位是決定整體任務成功的關鍵因素，強調了集成專門定位模塊的模塊化框架所帶來的顯著優勢。進一步地，要實現可靠的GUI自動化，代理需要具備強大的任務規劃與跨平台泛化能力，其中長上下文記憶、廣闊的行動空間及長期推理扮演著至關重要的角色。更重要的是，任務效率仍是一個極少被探索的維度，所有模型都存在顯著的效率低下問題，即便任務最終完成，也伴隨著過多的冗餘步驟。精確定位、有效規劃及早期停止策略的整合，對於實現真正高效且可擴展的GUI自動化不可或缺。我們的基準代碼、評估數據及運行環境將公開於https://github.com/open-compass/MMBench-GUI。

English

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.

MMBench-GUI：面向GUI代理的层次化多平台评估框架

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

摘要

Support