CityRAG: 空間的に接地された映像生成による都市への没入

要旨

本論文では、現実の場所をシミュレートする、空間的に接地された3D一貫性・航法可能な環境の生成問題に取り組む。既存のビデオ生成モデルは、テキスト（T2V）または画像（I2V）プロンプトと整合性のある説得力のある連続シーケンスを生成できる。しかし、任意の気象条件下および動的オブジェクト配置下での現実世界の再構築能力は、自動運転やロボティクスシミュレーションを含む下流アプリケーションにとって不可欠である。この目的のために、我々はCityRAGを提案する。これは、生成を物理シーンに接地させるための文脈として地理参照データの大規模コーパスを活用しつつ、複雑な動きや外観変化に関する学習済み事前分布を維持するビデオ生成モデルである。CityRAGは時間的に非整合なトレーニングデータに依存し、これによりモデルは基礎となるシーンとその一時的属性を意味的に分離して学習する。実験により、CityRAGがコヒーレントな数分間の物理的に接地されたビデオシーケンスを生成し、数千フレームにわたって気象及び照明条件を維持し、ループクロージャを達成し、複雑な軌道をナビゲートして現実世界の地理を再構築できることを実証する。

English

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

CityRAG: 空間的に接地された映像生成による都市への没入

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

要旨

Support