RAISECity:面向城市级现实对齐三维世界生成的多模态智能体框架
RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale
November 22, 2025
作者: Shengyuan Wang, Zhiheng Zheng, Yu Shang, Lixuan He, Yangcheng Yu, Fan Hangyu, Jie Feng, Qingmin Liao, Yong Li
cs.AI
摘要
城市级三维生成对于具身智能与世界模型的发展至关重要。然而现有方法在三维世界生成的质量、真实感与可扩展性方面面临重大挑战。为此,我们提出RAISECity——一种能够创建精细城市级三维世界的现实对齐智能合成引擎。我们引入了一种智能体框架,通过调用多模态基础工具获取现实世界知识、维持鲁棒的中间表征并构建复杂三维场景。该智能体设计具备动态数据处理、迭代式自反思优化以及高级多模态工具调用能力,能有效减少误差累积并提升整体性能。大量定量实验与定性分析表明,RAISECity在现实对齐度、形状精度、纹理保真度及美学水准方面均表现优异,在整体感知质量评估中以超过90%的胜率领先现有基线方法。这种兼具三维质量、现实对齐性、可扩展性以及与计算机图形管线无缝兼容的特点,使RAISECity成为沉浸式媒体、具身智能和世界模型应用的理想基础平台。
English
City-scale 3D generation is of great importance for the development of embodied intelligence and world models. Existing methods, however, face significant challenges regarding quality, fidelity, and scalability in 3D world generation. Thus, we propose RAISECity, a Reality-Aligned Intelligent Synthesis Engine that creates detailed, City-scale 3D worlds. We introduce an agentic framework that leverages diverse multimodal foundation tools to acquire real-world knowledge, maintain robust intermediate representations, and construct complex 3D scenes. This agentic design, featuring dynamic data processing, iterative self-reflection and refinement, and the invocation of advanced multimodal tools, minimizes cumulative errors and enhances overall performance. Extensive quantitative experiments and qualitative analyses validate the superior performance of RAISECity in real-world alignment, shape precision, texture fidelity, and aesthetics level, achieving over a 90% win-rate against existing baselines for overall perceptual quality. This combination of 3D quality, reality alignment, scalability, and seamless compatibility with computer graphics pipelines makes RAISECity a promising foundation for applications in immersive media, embodied intelligence, and world models.