ChatPaper.aiChatPaper

MAS-Bench:面向快捷增强型混合移动GUI代理的统一基准测试平台

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

September 8, 2025
作者: Pengxiang Zhao, Guangyi Liu, Yaozhen Liang, Weiqing He, Zhengxi Lu, Yuehao Huang, Yaxuan Guo, Kexin Zhang, Hao Wang, Liang Liu, Yong Liu
cs.AI

摘要

为提高GUI代理在智能手机和计算机等多种平台上的效率,一种融合灵活GUI操作与高效快捷方式(如API、深度链接)的混合范式正成为颇具前景的研究方向。然而,系统性地对这些混合代理进行基准测试的框架仍显不足。为填补这一空白迈出第一步,我们推出了MAS-Bench,这一基准测试工具开创性地专注于移动领域,评估GUI与快捷方式混合代理的性能。MAS-Bench不仅限于使用预定义的快捷方式,还评估代理通过发现并创建可复用、低成本的工作流来自主生成快捷方式的能力。它涵盖了11个真实应用中的139项复杂任务,包含88个预定义快捷方式(API、深度链接、RPA脚本)的知识库,以及7项评估指标。这些任务设计为仅通过GUI操作即可完成,但通过智能嵌入快捷方式可大幅加速完成过程。实验表明,混合代理相较于仅依赖GUI的代理,在成功率和效率上均有显著提升。这一结果也验证了我们评估代理快捷方式生成能力方法的有效性。MAS-Bench填补了关键评估空白,为未来开发更高效、更稳健的智能代理提供了基础平台。
English
To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.
PDF22September 9, 2025