CityRAG：基于空间锚定的视频生成技术实现城市漫游

摘要

我们致力于解决生成具有空间锚定性的三维一致性可导航环境问题——即对真实场景的仿真模拟。现有视频生成模型虽能基于文本或图像提示生成合理的连贯序列，但在任意天气条件与动态物体配置下重建真实世界的能力，对于自动驾驶和机器人仿真等下游应用至关重要。为此，我们提出CityRAG模型，该视频生成系统利用海量地理配准数据作为上下文，将生成内容锚定于物理场景，同时保持对复杂运动与外观变化的先验学习能力。CityRAG采用时序非对齐训练数据，使模型能够从语义层面解耦静态场景与瞬态属性。实验表明，CityRAG可生成长达数分钟、物理场景连贯的视频序列，在数千帧中保持天气与光照条件的一致性，实现路径闭环，并能通过复杂轨迹导航重建真实地理环境。

English

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

CityRAG：基于空间锚定的视频生成技术实现城市漫游

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

摘要

Support