MemGUI-Bench:动态环境下移动GUI智能体内存性能基准测试
MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
February 3, 2026
作者: Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, Zhengxi Lu, Gao Wu, Hao Wang, Liang Liu, Yong Liu
cs.AI
摘要
当前移动端GUI智能体基准测试普遍缺乏对记忆能力的系统评估,仅包含5.2%-11.8%的记忆相关任务且未涉及跨会话学习评估。我们推出MemGUI-Bench——一个采用pass@k评估和分层式LLM即评判机制的综合性记忆能力基准测试框架。本研究的贡献包括:(1) 基于5类架构对11种智能体进行系统化记忆能力分类分析;(2) 涵盖26个应用的128项任务,其中89.8%通过跨时空信息保持机制设计实现记忆挑战;(3) 集成渐进式审查机制与7级分层指标的MemGUI-Eval自动化评估管线;(4) 对11种前沿智能体开展问题导向型评估。实验结果表明:所有被测系统均存在显著记忆缺陷,我们据此识别出5类典型故障模式,并提炼出5项可落地的设计启示。所有资源(代码、基准测试集及评估结果)将通过https://lgy0404.github.io/MemGUI-Bench/ 持续开源维护。
English
Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textit{fully open-sourced and continuously maintained} at https://lgy0404.github.io/MemGUI-Bench/.