ChatPaper.aiChatPaper

四維推理學習:視覺語言模型的動態空間理解

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

December 23, 2025
作者: Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi
cs.AI

摘要

視覺語言模型(VLM)在通用理解任務中表現卓越,但在動態空間推理(DSR)——即推斷三維空間中物體幾何屬性與關係隨時間演變的能力——方面仍顯薄弱,這主要源於可擴展的四維感知訓練資源匱乏。為彌合數據集、基準測試與模型層面的斷層,我們推出DSR套件。首先,我們提出自動化流程,從真實場景影片生成多選式問答對以支持DSR。通過運用現代視覺基礎模型,該流程能提取豐富的幾何與運動資訊,包括相機姿態、局部點雲、物體遮罩、方位角及三維軌跡。這些幾何線索構建了用於學習的DSR-Train數據集,並經人工精修形成評估用的DSR-Bench基準。相較既有研究,我們的數據突出五大特點:(i)真實場景影片來源;(ii)物體與場景層級的三維需求;(iii)視角轉換;(iv)多物體互動;(v)細粒度程序化答案。除數據外,我們提出輕量級幾何選擇模組(GSM),將幾何先驗無縫整合至VLM中。該模組能濃縮問題語義,並從預訓練的四維重建先驗中提取問題相關知識,壓縮為緊湊的幾何標記集合,避免無關知識對模型的干擾。實驗表明,將DSR-Train與GSM整合至Qwen2.5-VL-7B模型後,其動態空間推理能力顯著提升,同時在通用影片理解基準上保持準確性。
English
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
PDF402December 26, 2025