MobileGym: モバイルGUIエージェント研究のための検証可能かつ高度に並列なシミュレーションプラットフォーム

要旨

我々は、プロプライエタリなバックエンドを再現することなくインタラクションの忠実性を重視し、日常的なモバイル利用を対象とした、ブラウザ上で動作する軽量で完全に制御可能な環境MobileGymを提案する。これにより、従来の日常的なアプリでは不可能であった2つの機能、すなわち構造化JSON状態に基づく決定論的な状態ベース評価による検証可能な結果シグナルと、低コストな並列ロールアウトによるスケーラブルなオンライン強化学習が実現される。環境の完全な状態は構造化JSONとして取得、設定、分岐、比較され、単一サーバで数百の並列インスタンスをホストでき、インスタンスあたり約400MBのメモリと約3秒のコールドスタートを要する。階層的な状態モデルと宣言的なタスク定義フレームワークにより、状態のプログラマビリティとタスク作成を大規模に実用的に保ち、単一のプログラム評価機構が決定論的な評価判定と密なRL報酬の両方を提供する。付属のMobileGym-Benchは、28のアプリにわたる416のパラメータ化されたタスクテンプレート（256のテストテンプレートと160のトレーニングテンプレートを含む）を提供し、決定論的な評価機構と、自由テキストマッチングの失敗を回避する構造化AnswerSheetプロトコルを備える。Sim-to-Realのケーススタディでは、Qwen3-VL-4B-Instruct上のGRPOが256タスクのテストセットで+12.8パーセンテージポイント向上し、59タスクの実デバイス信号サブセットでは、実デバイス実行がシミュレーション側のトレーニング効果の95.1%を維持した。プロジェクトページ: https://mobilegym.github.io

English

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.