MemGUI-Bench：动态环境下移动GUI智能体记忆能力基准测试框架

摘要

当前移动端GUI智能体基准测试系统性地缺失对记忆能力的评估，其中仅包含5.2%-11.8%的记忆相关任务且缺乏跨会话学习评估。我们推出MemGUI-Bench——一个采用pass@k指标和分阶段LLM即评判员评估机制的综合性记忆中心化基准测试。我们的贡献包括：(1) 涵盖5种架构下11个智能体的系统性记忆分类体系；(2) 跨26个应用的128项任务，其中89.8%通过跨时空信息保持机制挑战记忆能力；(3) MemGUI-Eval自动化评估流水线，配备渐进式审查机制和7个层级化指标；(4) 基于研究问题的11个前沿智能体评估。实验结果表明所有被测系统均存在显著记忆缺陷，我们识别出5类典型失效模式并总结出5项可落地的设计启示。所有资源（包括代码、基准测试及评估结果）将通过https://lgy0404.github.io/MemGUI-Bench/ 平台实现完全开源与持续维护。

English

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textit{fully open-sourced and continuously maintained} at https://lgy0404.github.io/MemGUI-Bench/.

MemGUI-Bench：动态环境下移动GUI智能体记忆能力基准测试框架

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

摘要

Support