MemGUI-Bench: 동적 환경에서 모바일 GUI 에이전트의 메모리 성능 벤치마킹

초록

현재 모바일 GUI 에이전트 벤치마크는 메모리 능력을 체계적으로 평가하지 못하고 있으며, 메모리 관련 작업은 5.2~11.8%에 불과하고 세션 간 학습 평가는 전혀 이루어지지 않고 있습니다. 본 논문에서는 포괄적인 메모리 중심 벤치마크인 MemGUI-Bench를 pass@k 및 단계별 LLM-as-judge 평가 방식과 함께 소개합니다. 본 논문의 주요 기여는 다음과 같습니다: (1) 5가지 아키텍처의 11개 에이전트를 분석하는 체계적인 메모리 분류 체계, (2) 26개 애플리케이션에 걸친 128개 작업 중 89.8%가 시간적·공간적 보존을 통해 메모리를 평가함, (3) Progressive Scrutiny 및 7가지 계층적 메트릭을 포함한 자동화 평가 파이프라인 MemGUI-Eval, (4) 11개의 최첨단 에이전트에 대한 연구 문제 기반 평가. 실험 결과 평가된 모든 시스템에서 심각한 메모리 결함이 발견되었으며, 5가지 뚜렷한 실패 모드를 식별하고 5가지 실행 가능한 설계 시사점을 종합하였습니다. 코드, 벤치마크, 평가 결과를 포함한 모든 자원은 https://lgy0404.github.io/MemGUI-Bench/에서 \textit{완전한 오픈소스로 지속적으로 관리될 예정}입니다.

English

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textit{fully open-sourced and continuously maintained} at https://lgy0404.github.io/MemGUI-Bench/.

MemGUI-Bench: 동적 환경에서 모바일 GUI 에이전트의 메모리 성능 벤치마킹

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

초록

Support