MAS-Bench: 단축키가 강화된 하이브리드 모바일 GUI 에이전트를 위한 통합 벤치마크

초록

스마트폰과 컴퓨터와 같은 다양한 플랫폼에서 GUI 에이전트의 효율성을 향상시키기 위해, 유연한 GUI 작업과 효율적인 단축키(예: API, 딥 링크)를 결합한 하이브리드 패러다임이 유망한 방향으로 부상하고 있습니다. 그러나 이러한 하이브리드 에이전트를 체계적으로 벤치마킹하기 위한 프레임워크는 아직 충분히 탐구되지 않았습니다. 이러한 격차를 해소하기 위한 첫걸음으로, 우리는 모바일 도메인에 특화된 GUI-단축키 하이브리드 에이전트의 평가를 선도하는 벤치마크인 MAS-Bench를 소개합니다. MAS-Bench는 미리 정의된 단축키를 사용하는 것을 넘어, 에이전트가 재사용 가능하고 저비용의 워크플로를 발견하고 생성함으로써 단축키를 자율적으로 생성하는 능력을 평가합니다. 이 벤치마크는 11개의 실제 애플리케이션에서 139개의 복잡한 작업, 88개의 미리 정의된 단축키(API, 딥 링크, RPA 스크립트) 지식 베이스, 그리고 7개의 평가 지표를 포함합니다. 작업은 GUI만으로도 해결 가능하도록 설계되었지만, 단축키를 지능적으로 활용함으로써 상당히 가속화될 수 있습니다. 실험 결과, 하이브리드 에이전트는 GUI만 사용하는 에이전트에 비해 훨씬 높은 성공률과 효율성을 달성했습니다. 이 결과는 또한 에이전트의 단축키 생성 능력을 평가하는 우리의 방법의 효과를 입증합니다. MAS-Bench는 중요한 평가 격차를 메우며, 더 효율적이고 강력한 지능형 에이전트를 개발하기 위한 미래의 발전을 위한 기초 플랫폼을 제공합니다.

English

To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.

MAS-Bench: 단축키가 강화된 하이브리드 모바일 GUI 에이전트를 위한 통합 벤치마크

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

초록

Support