MAS-Bench：一個統一基準測試平台，用於評估增強捷徑的混合行動GUI代理

摘要

爲了提升GUI代理在智能手機和電腦等多種平臺上的效率，一種結合靈活GUI操作與高效快捷方式（如API、深度鏈接）的混合範式正成爲一個頗具前景的研究方向。然而，系統性地對這些混合代理進行基準測試的框架仍未被充分探索。爲了邁出填補這一空白的第一步，我們引入了MAS-Bench，這是一個專注於移動領域、率先評估GUI-快捷方式混合代理的基準測試平臺。MAS-Bench不僅僅評估代理使用預定義快捷方式的能力，還着重評估其通過發現和創建可重複、低成本的工作流來自主生成快捷方式的能力。該基準測試平臺涵蓋了11個真實應用中的139個複雜任務，包含88個預定義快捷方式的知識庫（API、深度鏈接、RPA腳本），以及7個評估指標。這些任務設計爲僅通過GUI操作即可完成，但通過智能嵌入快捷方式可以顯著加速完成過程。實驗表明，混合代理相比僅依賴GUI操作的代理，在成功率和效率上均有顯著提升。這一結果也證明了我們評估代理快捷方式生成能力方法的有效性。MAS-Bench填補了一個關鍵的評估空白，爲未來創建更高效、更穩健的智能代理提供了基礎平臺。

English

To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.