ChatPaper.aiChatPaper

MemGUI-Bench:动态环境下移动GUI智能体记忆能力基准测试框架

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

February 3, 2026
作者: Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, Zhengxi Lu, Gao Wu, Hao Wang, Liang Liu, Yong Liu
cs.AI

摘要

当前移动端GUI智能体基准测试系统性地缺失对记忆能力的评估,其中仅包含5.2%-11.8%的记忆相关任务且缺乏跨会话学习评估。我们推出MemGUI-Bench——一个采用pass@k指标和分阶段LLM即评判员评估机制的综合性记忆中心化基准测试。我们的贡献包括:(1) 涵盖5种架构下11个智能体的系统性记忆分类体系;(2) 跨26个应用的128项任务,其中89.8%通过跨时空信息保持机制挑战记忆能力;(3) MemGUI-Eval自动化评估流水线,配备渐进式审查机制和7个层级化指标;(4) 基于研究问题的11个前沿智能体评估。实验结果表明所有被测系统均存在显著记忆缺陷,我们识别出5类典型失效模式并总结出5项可落地的设计启示。所有资源(包括代码、基准测试及评估结果)将通过https://lgy0404.github.io/MemGUI-Bench/ 平台实现完全开源与持续维护。
English
Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textit{fully open-sourced and continuously maintained} at https://lgy0404.github.io/MemGUI-Bench/.
PDF131March 16, 2026