RAISECity：面向城市级现实对齐三维世界生成的多模态智能体框架

摘要

城市级三维生成对于具身智能与世界模型的发展具有重要意义。然而现有方法在三维世界生成的质量、保真度与可扩展性方面面临重大挑战。为此，我们提出RAISECity——一种能够创建精细化城市级三维世界的现实对齐智能合成引擎。该框架通过智能体架构调用多模态基础工具获取现实世界知识，维持鲁棒的中间表征，并构建复杂三维场景。这种具备动态数据处理、迭代式自反思优化及多模态工具调用能力的智能体设计，有效减少了误差累积并提升整体性能。大量定量实验与定性分析表明，RAISECity在现实对齐度、几何精度、纹理保真度与美学水准方面均表现优异，在整体感知质量评估中以超过90%的胜率超越现有基线。兼具三维质量、现实对齐性、可扩展性以及与计算机图形管线的无缝兼容性，使RAISECity成为沉浸式媒体、具身智能和世界模型等领域极具前景的基础平台。

English

City-scale 3D generation is of great importance for the development of embodied intelligence and world models. Existing methods, however, face significant challenges regarding quality, fidelity, and scalability in 3D world generation. Thus, we propose RAISECity, a Reality-Aligned Intelligent Synthesis Engine that creates detailed, City-scale 3D worlds. We introduce an agentic framework that leverages diverse multimodal foundation tools to acquire real-world knowledge, maintain robust intermediate representations, and construct complex 3D scenes. This agentic design, featuring dynamic data processing, iterative self-reflection and refinement, and the invocation of advanced multimodal tools, minimizes cumulative errors and enhances overall performance. Extensive quantitative experiments and qualitative analyses validate the superior performance of RAISECity in real-world alignment, shape precision, texture fidelity, and aesthetics level, achieving over a 90% win-rate against existing baselines for overall perceptual quality. This combination of 3D quality, reality alignment, scalability, and seamless compatibility with computer graphics pipelines makes RAISECity a promising foundation for applications in immersive media, embodied intelligence, and world models.