MemGUI-Bench: Benchmarking della Memoria degli Agenti GUI Mobili in Ambienti Dinamici

Abstract

Gli attuali benchmark per agenti GUI mobili falliscono sistematicamente nel valutare le capacità di memoria, presentando solo il 5,2-11,8% di compiti legati alla memoria e nessuna valutazione dell'apprendimento cross-sessione. Introduciamo MemGUI-Bench, un benchmark completo incentrato sulla memoria con valutazione pass@k e LLM-as-judge a stadi. I nostri contributi includono: (1) una tassonomia sistematica della memoria che analizza 11 agenti su 5 architetture; (2) 128 compiti su 26 applicazioni dove l'89,8% mette alla prova la memoria attraverso la ritenzione cross-temporale e cross-spaziale; (3) MemGUI-Eval, una pipeline automatizzata con Scrutinio Progressivo e 7 metriche gerarchiche; e (4) una valutazione basata su RQ di 11 agenti all'avanguardia. I nostri esperimenti rivelano significativi deficit di memoria in tutti i sistemi valutati, identificano 5 modalità di fallimento distinte e sintetizzano 5 implicazioni progettuali attuabili. Tutte le risorse, inclusi codice, benchmark e risultati di valutazione, saranno \textit{completamente open-source e mantenute continuativamente} su https://lgy0404.github.io/MemGUI-Bench/.

English

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textit{fully open-sourced and continuously maintained} at https://lgy0404.github.io/MemGUI-Bench/.

MemGUI-Bench: Benchmarking della Memoria degli Agenti GUI Mobili in Ambienti Dinamici

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

Abstract

Support