INSPATIO-WORLD:基于时空自回归建模的实时四维世界模拟器
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
April 8, 2026
作者: InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan, Zhewen Le, Zhichao Ye, Ziqiang Zhao
cs.AI
摘要
构建具有空间一致性与实时交互性的世界模型仍是计算机视觉领域的核心挑战。当前视频生成范式常因缺乏空间持久性和视觉真实感而难以支持复杂环境中的无缝导航。为解决这些问题,我们提出INSPATIO-WORLD——一种能够从单段参考视频中复原并生成高保真动态交互场景的新型实时框架。该方法的核心理念是时空自回归(STAR)架构,通过两个紧密耦合的组件实现可控的场景演化:隐式时空缓存模块将参考帧与历史观测聚合为潜在世界表征,确保长时序导航中的全局一致性;显式空间约束模块则强化几何结构,将用户交互转化为精确且物理合理的相机轨迹。此外,我们提出联合分布匹配蒸馏(JDMD)技术,通过以真实世界数据分布作为正则化指导,有效克服了因过度依赖合成数据导致的保真度下降问题。大量实验表明,INSPATIO-WORLD在空间一致性与交互精度上显著超越现有最优模型,在WorldScore-Dynamic基准测试的实时交互方法中位列第一,为单目视频重建的四维环境导航建立了实用化技术路径。
English
Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.