MobileGym：一個可驗證且高度並行的模擬平台，用於行動 GUI 代理研究

摘要

我們提出MobileGym，這是一個基於瀏覽器、輕量級、完全可控的日常行動裝置使用環境，目標是在不複製專有後端的情況下實現互動保真度。它實現了過去日常應用無法達成的兩項能力：一是透過基於結構化JSON狀態的確定性狀態判斷來提供可驗證的結果訊號；二是透過低成本平行展開來實現可擴展的線上強化學習。完整的環境狀態以結構化JSON的形式被擷取、配置、分支與比較，單一伺服器即可承載數百個平行實例，每個實例約佔用400 MB記憶體，冷啟動時間約3秒。分層狀態模型與宣告式任務定義框架讓狀態的可程式化性與任務創建在大規模下具備實用性，而單一程式化判斷機制既能提供確定性評估結果，也能提供密集的強化學習獎勵。隨附的MobileGym-Bench提供了416個參數化任務模板，包括256個測試模板與160個訓練模板，橫跨28個應用程式，並配備確定性判斷器與結構化的AnswerSheet協定，避免了自由文字比對失敗的問題。在一個模擬到真實的案例研究中，基於Qwen3-VL-4B-Instruct的GRPO在256項任務測試集上獲得+12.8個百分點的提升，而在包含59項任務的真實裝置訊號子集上，真實裝置執行保留了模擬端訓練增益的95.1%。專案頁面：https://mobilegym.github.io。

English

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.