MemGUI-Bench: 動的環境におけるモバイルGUIエージェントのメモリ性能ベンチマーク

要旨

現行のモバイルGUIエージェントベンチマークは記憶能力の評価を体系的に見落としており、記憶関連タスクはわずか5.2-11.8%に留まり、セッション間学習の評価は皆無である。本論文では、pass@k評価と段階的LLM-as-judge評価を備えた包括的メモリ中心ベンチマーク「MemGUI-Bench」を提案する。主な貢献は以下の通りである：(1) 5アーキテクチャ・11エージェントを分析する体系的な記憶タクソノミー、(2) 89.8%のタスクが時間的・空間的記憶保持を要求する26アプリケーション・128タスク群、(3) Progressive Scrutinyと7段階階層指標からなる自動評価パイプライン「MemGUI-Eval」、(4) 11種の先進エージェントに対する研究課題主導型評価。実験結果から、全評価システムに深刻な記憶欠陥が存在すること、5つの異なる失敗モードを特定し、5つの実践的設計示唆を導出した。コード・ベンチマーク・評価結果を含む全リソースはhttps://lgy0404.github.io/MemGUI-Bench/ で完全オープンソース化し継続的に維持する。

English

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textit{fully open-sourced and continuously maintained} at https://lgy0404.github.io/MemGUI-Bench/.

MemGUI-Bench: 動的環境におけるモバイルGUIエージェントのメモリ性能ベンチマーク

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

要旨

Support