ChatPaper.aiChatPaper

四维推理学习:视觉语言模型的动态空间理解

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

December 23, 2025
作者: Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi
cs.AI

摘要

视觉语言模型(VLM)在通用理解方面表现卓越,但在动态空间推理(DSR)——即随时间推移对三维空间中物体几何属性与关系演变的推理——方面仍显薄弱,这主要源于可扩展的四维感知训练资源的匮乏。为从数据集、基准测试到模型层面全面弥合这一差距,我们推出DSR套件。首先,我们提出一种自动化流程,能够从真实场景视频中生成针对DSR的多选问答对。该流程通过利用现代视觉基础模型,提取丰富的几何与运动信息,包括相机位姿、局部点云、物体掩码、朝向及三维轨迹。这些几何线索支撑构建了用于学习的DSR-Train数据集,以及经人工优化的评估基准DSR-Bench。与现有工作相比,我们的数据突出强调:(i)真实场景视频源;(ii)物体与场景级别的三维需求;(iii)视角变换;(iv)多物体交互;(v)细粒度、过程化的答案。除数据外,我们提出轻量级几何选择模块(GSM),将几何先验无缝集成到VLM中。该模块能浓缩问题语义,并从预训练的四维重建先验中提取与问题相关的知识,将其压缩为一组紧凑的几何标记。这种定向提取避免了无关知识对模型的干扰。实验表明,将DSR-Train与GSM集成至Qwen2.5-VL-7B模型后,其动态空间推理能力显著提升,同时在通用视频理解基准上保持准确率。
English
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
PDF402December 26, 2025