CityRAG：基于空间锚定视频生成技术步入城市实景

摘要

我们致力于解决生成具有空间锚定性的三维一致性可导航环境问题——即对真实场景进行仿真模拟。现有视频生成模型能够根据文本（T2V）或图像（I2V）提示生成符合逻辑的连续序列，但要在任意天气条件和动态物体配置下重建真实世界的能力，对于自动驾驶和机器人仿真等下游应用至关重要。为此，我们提出CityRAG视频生成模型，该模型利用海量地理注册数据作为上下文，将生成内容锚定于物理场景，同时保持对复杂运动与外观变化的先验学习能力。CityRAG基于时间非对齐的训练数据，使模型能够从语义上解耦静态场景与动态属性。实验表明，CityRAG可生成连贯的分钟级物理锚定视频序列，在数千帧画面中保持天气与光照条件一致性，实现路径闭环，并能通过复杂轨迹导航重建真实世界地理环境。

English

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

CityRAG：基于空间锚定视频生成技术步入城市实景

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

摘要

Support