실제 도시를 기반으로 한 세계 시뮬레이션 모델 구축

초록

만약 세계 시뮬레이션 모델이 상상된 환경이 아닌 실제로 존재하는 도시를 구현할 수 있다면 어떨까? 기존 생성형 세계 모델은 모든 콘텐츠를 상상하여 시각적으로 그럴듯하지만 인공적인 환경을 합성해왔다. 본 논문은 실제 서울 도시를 기반으로 한 도시 규모의 세계 모델인 서울 월드 모델(Seoul World Model, SWM)을 제안한다. SWM은 인근 스트리트 뷰 이미지를 검색-증강(retrieval-augmented) 조건화로 활용하여 자기회귀적 비디오 생성을 정착(anchor)시킨다. 그러나 이러한 설계는 검색된 참조 영상과 변화하는 대상 장면 간의 시간적 불일치, 차량 탑재 캡처 방식의 희소한 간격으로 인한 제한된 궤적 다양성 및 데이터 희소성 등 여러 가지 과제를 야기한다. 우리는 다양한 카메라 궤적 생성을 가능하게 하는 대규모 합성 데이터셋, 희소한 스트리트 뷰 이미지로부터 일관된 학습용 비디오를 합성하는 뷰 보간 파이프라인, 그리고 교차-시간적 페어링(cross-temporal pairing)을 통해 이러한 과제들을 해결한다. 또한 장기 생성의 안정화를 위해 미래 위치에서 검색된 이미지로 각 청크를 지속적으로 재정착시키는 가상 전방 싱크(Virtual Lookahead Sink)를 도입한다. 우리는 SWM을 서울, 부산, 앤아버 세 도시를 대상으로 최근 비디오 세계 모델들과 비교 평가한다. SWM은 수백 미터에 이르는 궤적 상에서 실제 도시 환경에 기반한 공간적으로 정확하고 시간적으로 일관된 장기 비디오 생성에 있어 기존 방법들을 능가하며, 다양한 카메라 운동과 텍스트 프롬프트 기반 시나리오 변화를 지원한다.

English

What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.

실제 도시를 기반으로 한 세계 시뮬레이션 모델 구축

Grounding World Simulation Models in a Real-World Metropolis

초록

Support