SimuWoB: 模拟真实世界移动应用以进行快速且忠实的GUI代理基准测试

摘要

基于大语言模型的移动图形用户界面智能体发展迅速，亟需真实全面的评估方法。现有基准测试虽注重可复现性，但常局限于开源应用或文件操作任务——这源于在真实应用上构建奖励机制的困难，导致基准测试设置与现实使用存在差距。此外，多数基准聚焦于基础定位与导航功能，对复杂长期交互场景的覆盖有限。为突破这些局限，我们提出SimuWoB——一个全合成的移动GUI智能体基准测试，包含120项覆盖多类型与难度等级的挑战性任务。我们构建了稳健的虚拟环境生成框架，可合成高保真任务与环境，并自动为每项任务提供有效奖励。每个环境作为无后端网页部署并通过URL访问，支持高效可复现的评估。我们针对多个前沿移动GUI智能体开展了全面实验，平均成功率仅为27.92%，在长期任务中降至17.82%，揭示了当前智能体在复杂场景下的显著缺陷。与现实样本任务的评估结果对比表明，基于合成环境的智能体评估具有良好的泛化性。我们进一步提供了关键能力维度的诊断性分析，并探讨了对未来移动GUI智能体开发的启示。

English

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.