MAS-Bench: ショートカット拡張型ハイブリッドモバイルGUIエージェントのための統一ベンチマーク

要旨

スマートフォンやコンピュータなど様々なプラットフォームにおけるGUIエージェントの効率を向上させるため、柔軟なGUI操作と効率的なショートカット（例：API、ディープリンク）を組み合わせたハイブリッドパラダイムが有望な方向性として登場しています。しかし、これらのハイブリッドエージェントを体系的にベンチマークするためのフレームワークはまだ十分に検討されていません。このギャップを埋める第一歩として、我々はMAS-Benchを紹介します。これは、特にモバイル領域に焦点を当てたGUI-ショートカットハイブリッドエージェントの評価を先駆けるベンチマークです。MAS-Benchは、事前定義されたショートカットを使用するだけでなく、再利用可能で低コストのワークフローを発見・作成することで、エージェントが自律的にショートカットを生成する能力を評価します。11の実世界アプリケーションにわたる139の複雑なタスク、88の事前定義されたショートカット（API、ディープリンク、RPAスクリプト）の知識ベース、および7つの評価指標を特徴としています。タスクはGUIのみの操作で解決可能ですが、ショートカットをインテリジェントに組み込むことで大幅に加速できます。実験では、ハイブリッドエージェントがGUIのみのエージェントよりも大幅に高い成功率と効率を達成することが示されました。この結果は、エージェントのショートカット生成能力を評価する我々の方法の有効性も示しています。MAS-Benchは重要な評価ギャップを埋め、より効率的で堅牢なインテリジェントエージェントを作成するための将来の進歩のための基盤となるプラットフォームを提供します。

English

To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.

MAS-Bench: ショートカット拡張型ハイブリッドモバイルGUIエージェントのための統一ベンチマーク

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

要旨

Support