INSPATIO-WORLD:基於時空自迴歸建模的即時四維世界模擬器
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
April 8, 2026
作者: InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan, Zhewen Le, Zhichao Ye, Ziqiang Zhao
cs.AI
摘要
在計算機視覺領域,構建具有空間一致性和實時交互性的世界模型仍是根本性挑戰。現有視頻生成範式常因缺乏空間持續性和視覺真實感不足而難以支持複雜環境中的無縫導航。為解決這些難題,我們提出INSPATIO-WORLD——一種能從單一參考視頻中恢復並生成高保真動態交互場景的新型實時框架。我們方法的核心在於時空自回歸(STAR)架構,該架構通過兩個緊密耦合的組件實現一致可控的場景演化:隱式時空緩存將參考幀與歷史觀測聚合為潛在空間表徵,確保長時序導航中的全局一致性;顯式空間約束模塊則強化幾何結構,將用戶交互轉化為精確且物理合理的相機軌跡。此外,我們提出聯合分佈匹配蒸餾(JDMD)技術,通過以真實世界數據分佈作為正則化指導,有效克服因過度依賴合成數據導致的保真度下降問題。大量實驗表明,INSPATIO-WORLD在空間一致性和交互精度上顯著優於現有頂尖模型,於WorldScore-Dynamic基準測試的實時交互方法中位列第一,為從單目視頻重建的四維環境導航建立了實用化流程。
English
Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.