ChatPaper.aiChatPaper

Video4Spatial:基于上下文引导视频生成的视觉空间智能探索

Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

December 2, 2025
作者: Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning, Rahul Garg, Roshni Cooper, Mohammad H. Taghavi, Xingang Pan
cs.AI

摘要

我们研究视频生成模型能否仅凭视觉数据展现人类认知核心能力——视觉空间智能。为此,我们提出Video4Spatial框架,证明仅以视频场景上下文为条件的扩散模型能完成复杂空间任务。我们在两项任务上验证:场景导航(遵循相机位姿指令同时保持与场景三维几何一致)和物体定位(需兼具语义定位、指令执行与路径规划能力)。两项任务均仅使用视频输入,无需深度或位姿等辅助模态。通过框架设计与数据策展的简洁有效方案,Video4Spatial展现出基于视频上下文的强大空间理解能力:端到端规划导航路径并定位目标物体,在遵循相机位姿指令时保持空间一致性,且能泛化至长时序场景及域外环境。这些成果共同推动视频生成模型向通用视觉空间推理迈进。
English
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.
PDF31December 4, 2025