ChatPaper.aiChatPaper

Video4Spatial:面向视觉空间智能的上下文引导视频生成

Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

December 2, 2025
作者: Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning, Rahul Garg, Roshni Cooper, Mohammad H. Taghavi, Xingang Pan
cs.AI

摘要

我們探討僅憑視覺數據,視頻生成模型能否展現人類認知核心能力——視覺空間智能。為此,我們提出Video4Spatial框架,證明僅以視頻場景上下文為條件的視頻擴散模型可執行複雜空間任務。我們在兩項任務上進行驗證:場景導航(遵循相機位姿指令同時保持與場景3D幾何一致性)和物體定位(需兼具語義定位、指令遵循與路徑規劃能力)。兩項任務均僅使用視頻輸入,無需深度信息或位姿等輔助模態。通過框架設計與數據構建的簡潔有效方案,Video4Spatial展現出從視頻上下文中獲取的強大空間理解能力:能端到端規劃導航路徑並定位目標物體,在遵循相機位姿指令的同時保持空間一致性,並能泛化至長時序上下文與域外環境。這些成果共同推動視頻生成模型向通用視覺空間推理邁進。
English
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.
PDF31December 4, 2025