SimuWoB: 高速かつ高忠実なGUIエージェントベンチマーキングのための実世界モバイルアプリシミュレーション

要旨

大規模言語モデルを活用したモバイルGUIエージェントは急速に進歩しており、現実的かつ包括的な評価に対する緊急の必要性が生じている。既存のベンチマークは再現性を優先しているが、実際のアプリケーションでの報酬構築の難しさから、多くの場合オープンソースアプリやファイル操作タスクに限定されており、ベンチマーク設定と実世界での使用との間に乖離が生じている。さらに、ほとんどのベンチマークは基本的な接地とナビゲーションに焦点を当てており、複雑で長期的なインタラクションのカバレッジは限られている。これらの制限に対処するため、我々はSimuWoBを導入する。これは完全に合成されたモバイルGUIエージェント向けベンチマークであり、多様なタイプと難易度にわたる120の挑戦的なタスクを含む。我々は、高忠実度のタスクと環境を合成し、各タスクに対して自動的に有効な報酬を提供する堅牢な仮想環境生成フレームワークを構築する。各環境はURLを介してアクセス可能なバックエンド不要のウェブページとしてデプロイされ、効率的かつ再現性のある評価を可能にする。我々は、最先端のモバイルGUIエージェント数種類に対して包括的な実験を実施した。平均成功率はわずか27.92％であり、長期的タスクでは17.82％に低下し、複雑なシナリオにおける現在のエージェントの顕著な弱点が明らかになった。実世界のサンプルタスクとの評価結果の比較は、我々の合成環境に基づくエージェント評価が良好に一般化することを示している。さらに、主要な能力次元にわたる診断的洞察を提供し、将来のモバイルGUIエージェント開発への示唆について議論する。

English

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.