ChatPaper.aiChatPaper

MAS-Bench:一個統一基準測試平台,用於評估增強捷徑的混合行動GUI代理

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

September 8, 2025
作者: Pengxiang Zhao, Guangyi Liu, Yaozhen Liang, Weiqing He, Zhengxi Lu, Yuehao Huang, Yaxuan Guo, Kexin Zhang, Hao Wang, Liang Liu, Yong Liu
cs.AI

摘要

爲了提升GUI代理在智能手機和電腦等多種平臺上的效率,一種結合靈活GUI操作與高效快捷方式(如API、深度鏈接)的混合範式正成爲一個頗具前景的研究方向。然而,系統性地對這些混合代理進行基準測試的框架仍未被充分探索。爲了邁出填補這一空白的第一步,我們引入了MAS-Bench,這是一個專注於移動領域、率先評估GUI-快捷方式混合代理的基準測試平臺。MAS-Bench不僅僅評估代理使用預定義快捷方式的能力,還着重評估其通過發現和創建可重複、低成本的工作流來自主生成快捷方式的能力。該基準測試平臺涵蓋了11個真實應用中的139個複雜任務,包含88個預定義快捷方式的知識庫(API、深度鏈接、RPA腳本),以及7個評估指標。這些任務設計爲僅通過GUI操作即可完成,但通過智能嵌入快捷方式可以顯著加速完成過程。實驗表明,混合代理相比僅依賴GUI操作的代理,在成功率和效率上均有顯著提升。這一結果也證明了我們評估代理快捷方式生成能力方法的有效性。MAS-Bench填補了一個關鍵的評估空白,爲未來創建更高效、更穩健的智能代理提供了基礎平臺。
English
To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.
PDF22September 9, 2025