MAS-Bench：面向快捷增强型混合移动GUI代理的统一基准测试平台

摘要

为提高GUI代理在智能手机和计算机等多种平台上的效率，一种融合灵活GUI操作与高效快捷方式（如API、深度链接）的混合范式正成为颇具前景的研究方向。然而，系统性地对这些混合代理进行基准测试的框架仍显不足。为填补这一空白迈出第一步，我们推出了MAS-Bench，这一基准测试工具开创性地专注于移动领域，评估GUI与快捷方式混合代理的性能。MAS-Bench不仅限于使用预定义的快捷方式，还评估代理通过发现并创建可复用、低成本的工作流来自主生成快捷方式的能力。它涵盖了11个真实应用中的139项复杂任务，包含88个预定义快捷方式（API、深度链接、RPA脚本）的知识库，以及7项评估指标。这些任务设计为仅通过GUI操作即可完成，但通过智能嵌入快捷方式可大幅加速完成过程。实验表明，混合代理相较于仅依赖GUI的代理，在成功率和效率上均有显著提升。这一结果也验证了我们评估代理快捷方式生成能力方法的有效性。MAS-Bench填补了关键评估空白，为未来开发更高效、更稳健的智能代理提供了基础平台。

English

To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.

MAS-Bench：面向快捷增强型混合移动GUI代理的统一基准测试平台

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

摘要

Support