SimuWoB：模擬真實世界手機應用程式以實現快速且忠實的GUI代理基準測試

摘要

由大型語言模型驅動的行動圖形使用者介面代理已快速發展，亟需真實且全面的評估基準。現有基準雖重視可重現性，但受限於開源應用或檔案操作任務，因難以在真實應用中構建獎勵機制，導致基準設定與真實使用情境存在落差。此外，多數基準聚焦於基礎定位與導航，對於複雜、長程互動的涵蓋有限。為解決這些限制，我們提出 SimuWoB——一個完全合成的行動圖形使用者介面代理基準，包含 120 項跨越多種類型與難度級別的挑戰性任務。我們建構了一個穩健的虛擬環境生成框架，該框架能合成高保真度的任務與環境，並自動為每項任務提供有效獎勵。每個環境以無後端網頁形式部署，可透過網址存取，實現高效且可重現的評估。我們對多個最先進的行動圖形使用者介面代理進行了全面實驗，發現平均成功率僅為 27.92%，在長程任務中更降至 17.82%，凸顯當前代理在複雜場景中的顯著弱點。與真實樣本任務的評估結果比較顯示，基於我們合成環境的代理評估具有良好的泛化能力。我們進一步提供關鍵能力維度的診斷性見解，並討論對未來行動圖形使用者介面代理發展的啟示。

English

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.