生成的ビジュアルコードモバイル世界モデル

要旨

モバイルグラフィカルユーザーインターフェース（GUI）の世界モデル（WM）は、学習時および推論時のモバイルGUIエージェントの性能向上に向けた有望なアプローチである。しかし、現在の手法は重大なトレードオフに直面している。テキストベースのWMは視覚的忠実度を犠牲にする一方、視覚的WMは正確なテキスト描画が不可能なため、多数の外部モデルに依存した低速で複雑なパイプラインに頼らざるを得ない。我々は新しいパラダイムを提案する：レンダリング可能なコード生成による視覚的世界モデリングである。これは、単一のVision-Language Model（VLM）がピクセルを直接生成するのではなく、実行可能なWebコードとして次のGUI状態を予測し、それがピクセルにレンダリングされる手法である。これにより両アプローチの長所が組み合わされる：VLMは正確なテキスト描画のための言語的優先知識を保持しつつ、構造化されたWebコードに対する事前学習により高忠実度の視覚的生成を可能にする。本パラダイムに基づく初のオープンウェイト視覚的モバイルGUI WMであるgWorld（8B, 32B）と、コードベースの学習データを自動合成するデータ生成フレームワーク（gWorld）を導入する。4つの内部評価データセットおよび2つの外部評価データセットを用いた大規模評価において、gWorldは精度とモデルサイズの関係で新たなパレートフロンティアを確立し、最大50.25倍大きな8つの先端オープンウェイトモデルを凌駕した。さらなる分析により、(1) gWorldによる学習データのスケーリングが有意な性能向上をもたらすこと、(2) パイプラインの各構成要素がデータ品質向上に寄与すること、(3) 強力な世界モデリングが下流のモバイルGUIポリシー性能を改善することが示された。

English

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

生成的ビジュアルコードモバイル世界モデル

Generative Visual Code Mobile World Models

要旨

Support