Code2World: Ein GUI-Weltmodell durch erzeugbaren Render-Code

papers.abstract

Autonome GUI-Agenten interagieren mit Umgebungen, indem sie Oberflächen wahrnehmen und Aktionen ausführen. Als virtuelle Sandbox ermöglicht das GUI-World-Modell Agenten eine menschenähnliche Vorausschau durch aktionsbedingte Vorhersagen. Bisherige text- und pixelbasierte Ansätze erreichen jedoch kaum gleichzeitig hohe visuelle Qualität und feinkörnige strukturelle Steuerbarkeit. Hierfür schlagen wir Code2World vor, einen Vision-Language-Coder, der den nächsten visuellen Zustand durch renderbare Code-Generierung simuliert. Um das Problem der Datenknappheit zu lösen, erstellen wir AndroidCode, indem wir GUI-Trajektorien in hochwertiges HTML übersetzen und synthetisierten Code durch einen Visual-Feedback-Revisionsmechanismus verfeinern – entsteht ein Korpus mit über 80.000 hochwertigen Bildschirm-Aktions-Paaren. Um bestehende VLMs für Code-Vorhersagen anzupassen, führen wir zunächst SFT als Kaltstart für Layoutformatierung durch und wenden dann Render-Aware Reinforcement Learning an, das gerenderte Ergebnisse als Belohnungssignal nutzt, indem visuelle semantische Treue und Aktionskonsistenz erzwungen werden. Umfangreiche Experimente zeigen, dass Code2World-8B bei der nächsten UI-Vorhersage führend ist und mit konkurrenzfähigen Modellen wie GPT-5 und Gemini-3-Pro-Image mithalten kann. Bemerkenswerterweise steigert Code2World downstream Navigationserfolgsraten flexibel und verbessert Gemini-2.5-Flash um +9,5 % bei AndroidWorld-Navigation. Der Code ist verfügbar unter https://github.com/AMAP-ML/Code2World.

English

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.

Code2World: Ein GUI-Weltmodell durch erzeugbaren Render-Code

Code2World: A GUI World Model via Renderable Code Generation

papers.abstract

Support