생성형 시각 코드 모바일 세계 모델

초록

모바일 그래픽 사용자 인터페이스(GUI) 세계 모델(WM)은 학습 및 추론 시점에서 모바일 GUI 에이전트 성능 향상을 위한 유망한 방안을 제시합니다. 그러나 현재 접근법은 중요한 절충에 직면해 있습니다: 텍스트 기반 WM은 시각적 충실도를 희생하는 반면, 시각적 WM의 정확한 텍스트 렌더링 부재는 수많은 외부 모델에 의존하는 느리고 복잡한 파이프라인에 의존하게 만듭니다. 우리는 픽셀을 직접 생성하는 대신 실행 가능한 웹 코드로 다음 GUI 상태를 예측하는 단일 시각-언어 모델(VLM)을 통한 시각적 세계 모델링, 즉 렌더링 가능한 코드 생성을 통한 새로운 패러다임을 제안합니다. 이는 두 접근법의 장점을 결합합니다: VLM은 정확한 텍스트 렌더링을 위한 언어적 사전 지식을 유지하면서 구조화된 웹 코드에 대한 사전 학습을 통해 높은 충실도의 시각적 생성을 가능하게 합니다. 우리는 이 패러다임을 기반으로 구축된 최초의 오픈 웨이트 시각적 모바일 GUI WM인 gWorld(8B, 32B)와 코드 기반 학습 데이터를 자동으로 합성하는 데이터 생성 프레임워크(gWorldGen)를 소개합니다. 4개의 내부 분포 및 2개의 외부 분포 벤치마크에 대한 광범위한 평가에서 gWorld는 정확도 대 모델 크기 측면에서 새로운 파레토 프론티어를 설정하며, 최대 50.25배 큰 8개의 최첨단 오픈 웨이트 모델들을 능가했습니다. 추가 분석은 (1) gWorldGen을 통한 학습 데이터 확장이 의미 있는 성능 향상을 가져오며, (2) 우리 파이프라인의 각 구성 요소가 데이터 품질을 향상시키고, (3) 더 강력한 세계 모델링이 다운스트림 모바일 GUI 정책 성능을 향상시킨다는 것을 보여줍니다.

English

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

생성형 시각 코드 모바일 세계 모델

Generative Visual Code Mobile World Models

초록

Support