ChatPaper.aiChatPaper

生成式视觉代码移动世界模型

Generative Visual Code Mobile World Models

February 2, 2026
作者: Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin
cs.AI

摘要

移动图形用户界面(GUI)世界模型(WMs)为提升移动GUI智能体在训练和推理阶段的性能提供了可行路径。然而现有方法面临关键权衡:基于文本的世界模型牺牲视觉保真度,而视觉世界模型因无法精确渲染文本,不得不依赖缓慢复杂、需调用多个外部模型的流程。我们提出全新范式:通过可渲染代码生成实现视觉世界建模,即让单一视觉语言模型(VLM)将下一GUI状态预测为可执行网页代码(渲染为像素),而非直接生成像素。该方案融合了两类方法的优势:VLM既保持了语言先验以实现精准文本渲染,又通过预训练阶段对结构化网页代码的学习实现了高保真视觉生成。我们推出基于此范式的首个开源视觉移动GUI世界模型gWorld(8B/32B参数版本),并配套自动生成代码训练数据的基础设施gWorld。在4个域内与2个域外基准测试中,gWorld在准确率与模型规模间建立了新的帕累托前沿,以50.25倍更小的参数量超越8个前沿开源模型。进一步分析表明:(1)通过gWorld扩展训练数据能带来显著增益;(2)流程中各组件均能提升数据质量;(3)更强的世界建模能力可提升下游移动GUI策略性能。
English
Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.
PDF414March 12, 2026