MemGUI-Bench：动态环境下移动GUI智能体内存性能基准测试

摘要

当前移动端GUI智能体基准测试普遍缺乏对记忆能力的系统评估，仅包含5.2%-11.8%的记忆相关任务且未涉及跨会话学习评估。我们推出MemGUI-Bench——一个采用pass@k评估和分层式LLM即评判机制的综合性记忆能力基准测试框架。本研究的贡献包括：(1) 基于5类架构对11种智能体进行系统化记忆能力分类分析；(2) 涵盖26个应用的128项任务，其中89.8%通过跨时空信息保持机制设计实现记忆挑战；(3) 集成渐进式审查机制与7级分层指标的MemGUI-Eval自动化评估管线；(4) 对11种前沿智能体开展问题导向型评估。实验结果表明：所有被测系统均存在显著记忆缺陷，我们据此识别出5类典型故障模式，并提炼出5项可落地的设计启示。所有资源（代码、基准测试集及评估结果）将通过https://lgy0404.github.io/MemGUI-Bench/ 持续开源维护。

English

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textit{fully open-sourced and continuously maintained} at https://lgy0404.github.io/MemGUI-Bench/.

MemGUI-Bench：动态环境下移动GUI智能体内存性能基准测试

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

摘要

Support