MobileGym: 모바일 GUI 에이전트 연구를 위한 검증 가능하고 고도로 병렬화된 시뮬레이션 플랫폼

초록

본 논문에서는 MobileGym을 제시한다. 이는 브라우저 기반의 가볍고 완전히 제어 가능한 환경으로, 일상적인 모바일 사용을 대상으로 하며, 독점적인 백엔드를 복제하지 않으면서 상호작용 충실도를 목표로 한다. MobileGym은 일상적인 애플리케이션에서는 이전에 달성할 수 없었던 두 가지 기능을 가능하게 한다: 구조화된 JSON 상태에 대한 결정론적 상태 기반 판정을 통한 검증 가능한 결과 신호, 그리고 저비용 병렬 롤아웃을 통한 확장 가능한 온라인 강화 학습. 전체 환경 상태는 구조화된 JSON으로 포착, 구성, 분기, 비교되며, 단일 서버는 인스턴스당 약 400MB의 메모리와 약 3초의 콜드 스타트로 수백 개의 병렬 인스턴스를 호스팅할 수 있다. 계층적 상태 모델과 선언적 작업 정의 프레임워크는 대규모에서 상태 프로그래밍 가능성과 작업 생성을 실용적으로 유지하며, 단일 프로그래밍 방식 판정 메커니즘은 결정론적 평가 결과와 고밀도 강화 학습 보상을 모두 제공한다. 함께 제공되는 MobileGym-Bench는 28개의 앱에 걸쳐 256개의 테스트 템플릿과 160개의 훈련 템플릿을 포함한 416개의 매개변수화된 작업 템플릿을 제공하며, 결정론적 판정기와 자유 텍스트 매칭 실패를 방지하는 구조화된 AnswerSheet 프로토콜을 갖추고 있다. Sim-to-Real 사례 연구에서 Qwen3-VL-4B-Instruct에 적용된 GRPO는 256개 작업 테스트 세트에서 +12.8%p의 성능 향상을 보였으며, 59개 작업으로 구성된 실제 기기 신호 하위 집합에서는 실제 기기 실행이 시뮬레이션 측 훈련 이득의 95.1%를 유지했다. 프로젝트 페이지: https://mobilegym.github.io.

English

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.