將世界模擬模型錨定於真實世界大都會

摘要

若世界模擬模型能呈現的不僅是虛構環境，而是真實存在的城市，將會如何？過往的生成式世界模型透過全內容想像來合成視覺合理但虛擬的環境。我們提出首爾世界模型（SWM），這是一個以真實首爾城市為基礎的城域級世界模型。SWM透過鄰近街景圖像的檢索增強條件化，實現自迴歸影片生成的錨定。然而，此設計帶來多項挑戰：檢索參考幀與動態目標場景間的時間錯位、車輛間歇採集導致的軌跡多樣性受限及數據稀疏性。我們透過跨時序配對技術、支援多樣化攝影機軌跡的大規模合成數據集，以及從稀疏街景圖像生成連貫訓練影片的視角插值流程來解決這些難題。我們更引入虛擬前瞻錨點機制，透過持續將每個生成片段重新錨定至未來位置的檢索圖像，穩定長時序生成效果。針對首爾、釜山與安娜堡三座城市，我們將SWM與近期影片世界模型進行比較評估。結果顯示SWM在生成空間精準、時間連貫且錨定於真實都市環境的長時序影片（軌跡長度達數百公尺）方面優於現有方法，同時支援多樣化攝影機運動與文字引導的場景變化。

English

What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.

將世界模擬模型錨定於真實世界大都會

Grounding World Simulation Models in a Real-World Metropolis

摘要

Support