生成式视觉代码移动世界模型
Generative Visual Code Mobile World Models
February 2, 2026
作者: Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin
cs.AI
摘要
移动图形用户界面(GUI)世界模型(WM)为提升移动GUI智能体在训练和推理阶段的性能提供了可行路径。然而现有方法面临关键权衡:基于文本的世界模型牺牲了视觉保真度,而视觉世界模型在精确文本渲染方面的缺陷导致其依赖缓慢复杂、需要多个外部模型的流程。我们提出了一种新颖范式:通过可渲染代码生成实现视觉世界建模,即使用单一视觉语言模型(VLM)预测可执行网页代码形式的下一GUI状态(该代码可渲染为像素),而非直接生成像素。这种方法融合了两种范式的优势:VLM既保持了语言先验以实现精确文本渲染,又通过结构化网页代码的预训练实现了高保真视觉生成。我们推出了基于此范式的首个开放权重视觉移动GUI世界模型gWorld(8B/32B),同时开发了可自动合成代码训练数据的生成框架(gWorld)。在4个域内和2个域外基准测试中,gWorld在准确率与模型规模之间建立了新的帕累托前沿,性能超越8个前沿开放权重模型(最大模型规模达gWorld的50.25倍)。进一步分析表明:(1)通过gWorld扩展训练数据能带来显著增益;(2)流程中各组件均能提升数据质量;(3)更强的世界建模能力可改进下游移动GUI策略性能。
English
Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.