SimuWoB: 실제 모바일 앱 시뮬레이션을 통한 신속하고 충실한 GUI 에이전트 벤치마킹

초록

대규모 언어 모델 기반의 모바일 GUI 에이전트가 빠르게 발전하면서 현실적이고 포괄적인 평가의 필요성이 긴급히 대두되고 있다. 기존 벤치마크는 재현성을 우선시하지만, 실제 애플리케이션에서 보상을 구성하기 어려워 오픈소스 앱이나 파일 조작 작업에 국한되는 경우가 많아, 벤치마크 환경과 실제 사용 환경 간 차이가 존재한다. 또한, 대부분의 벤치마크는 기본적인 그라운딩 및 탐색에 초점을 맞추고 있으며, 복잡하고 장기적인 상호작용에 대한 포괄성은 제한적이다. 이러한 한계를 해결하기 위해, 우리는 다양한 유형과 난이도를 아우르는 120개의 도전적인 과제로 구성된 완전 합성 벤치마크인 SimuWoB를 소개한다. 우리는 고충실도의 과제와 환경을 합성하고 각 과제에 대해 자동으로 유효한 보상을 제공하는 강력한 가상 환경 생성 프레임워크를 구축했다. 각 환경은 URL을 통해 접근 가능한 백엔드 없는 웹페이지로 배포되어 효율적이고 재현 가능한 평가를 가능하게 한다. 우리는 여러 최첨단 모바일 GUI 에이전트에 대해 포괄적인 실험을 수행했다. 평균 성공률은 27.92%에 불과했으며, 장기 과제에서는 17.82%로 떨어져 복잡한 시나리오에서 현재 에이전트의 상당한 취약점을 드러냈다. 실제 샘플 과제와의 평가 결과 비교는 우리의 합성 환경 기반 에이전트 평가가 잘 일반화됨을 입증한다. 또한, 우리는 주요 역량 차원에 걸친 진단적 통찰을 제공하고 향후 모바일 GUI 에이전트 개발에 대한 시사점을 논의한다.

English

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.