世界構築:基於文本生成可視化世界的智能體框架
World Craft: Agentic Framework to Create Visualizable Worlds via Text
January 14, 2026
作者: Jianwen Sun, Yukang Feng, Kaining Ying, Chuanhao Li, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Yifan Chang, Yu Dai, Yifei Huang, Kaipeng Zhang
cs.AI
摘要
大型語言模型(LLMs)推動了生成式智能體模擬(如AI Town)以建構「動態世界」,在娛樂與研究領域具有巨大價值。然而對非專業人士(尤其缺乏程式設計能力者)而言,自行客製化可視化環境存在困難。本文提出World Craft——一個透過用戶文本描述即可創建可執行、可視化AI Town的智能世界建構框架。該框架包含兩大核心模組:World Scaffold與World Guild。World Scaffold是開發互動遊戲場景的結構化簡潔標準,作為LLMs客製化可執行類AI Town環境的高效腳手架;World Guild則透過多智能體框架逐步解析用戶粗略描述中的意圖,並為World Scaffold合成所需結構化內容(如環境佈局與資源)。此外,我們透過逆向工程建構高品質糾錯數據集,以增強空間知識並提升佈局生成的穩定性與可控性,同時提供多維度評估指標供深入分析。大量實驗表明,本框架在場景建構與敘事意圖傳達方面顯著優於現有商業程式碼智能體(Cursor與Antigravity)及LLMs(Qwen3與Gemini-3-Pro),為環境創建的普及化提供了可擴展解決方案。
English
Large Language Models (LLMs) motivate generative agent simulation (e.g., AI Town) to create a ``dynamic world'', holding immense value across entertainment and research. However, for non-experts, especially those without programming skills, it isn't easy to customize a visualizable environment by themselves. In this paper, we introduce World Craft, an agentic world creation framework to create an executable and visualizable AI Town via user textual descriptions. It consists of two main modules, World Scaffold and World Guild. World Scaffold is a structured and concise standardization to develop interactive game scenes, serving as an efficient scaffolding for LLMs to customize an executable AI Town-like environment. World Guild is a multi-agent framework to progressively analyze users' intents from rough descriptions, and synthesizes required structured contents (\eg environment layout and assets) for World Scaffold . Moreover, we construct a high-quality error-correction dataset via reverse engineering to enhance spatial knowledge and improve the stability and controllability of layout generation, while reporting multi-dimensional evaluation metrics for further analysis. Extensive experiments demonstrate that our framework significantly outperforms existing commercial code agents (Cursor and Antigravity) and LLMs (Qwen3 and Gemini-3-Pro). in scene construction and narrative intent conveyance, providing a scalable solution for the democratization of environment creation.