CityRAG: 공간 기반 비디오 생성으로 도시 체험하기

초록

본 논문은 실제 공간을 기반으로 한 3D 일관성 및 탐색 가능 환경 생성, 즉 실제 위치의 시뮬레이션 문제를 다룹니다. 기존 영상 생성 모델은 텍스트(T2V) 또는 이미지(I2V) 프롬프트와 일관된 그럴듯한 연속 영상을 생성할 수 있습니다. 그러나 임의의 기상 조건과 동적 객체 구성 하에서 현실 세계를 재구성하는 능력은 자율 주행 및 로봇 시뮬레이션을 포함한 다운스트림 애플리케이션에 필수적입니다. 이를 위해 우리는 대규모 지리 참조 데이터를 컨텍스트로 활용하여 생성 과정을 물리적 장면에 정착시키면서도 복잡한 운동 및 외관 변화에 대한 학습된 사전 지식을 유지하는 영상 생성 모델 CityRAG를 제시합니다. CityRAG는 시간적으로 정렬되지 않은 훈련 데이터에 의존하며, 이를 통해 모델이 기본 장면과 일시적 속성을 의미론적으로 분리하는 방법을 학습합니다. 우리의 실험 결과, CityRAG는 수분 길이의 물리적으로 정착된 일관된 영상 시퀀스를 생성하고, 수천 프레임에 걸쳐 기상 및 조명 조건을 유지하며, 루프 클로저를 달성하고, 복잡한 궤적을 탐색하여 실제 세계 지형을 재구성할 수 있음을 보여줍니다.

English

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

CityRAG: 공간 기반 비디오 생성으로 도시 체험하기

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

초록

Support