ChatPaper.aiChatPaper

MASS:面向视觉语言模型物理推理与理解的运动感知时空定位

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

November 23, 2025
作者: Xiyang Wu, Zongxia Li, Jihui Jin, Guangyao Shi, Gouthaman KV, Vishnu Raj, Nilotpal Sinha, Jingxi Chen, Fan Du, Dinesh Manocha
cs.AI

摘要

视觉语言模型(VLMs)在标准视频任务中表现优异,但在涉及运动动力学与空间交互的物理推理方面存在局限。这一缺陷削弱了其对真实或AI生成内容(AIGC)视频的解析能力,也影响了生成物理一致性内容的效果。我们提出一种解决方案:将物理世界上下文线索转化为符合VLM感知、理解与推理机制的可解释表征。本文推出MASS-Bench综合基准,包含4,350个真实场景与AIGC视频及8,361个自由形式视频问答对,聚焦物理相关理解任务,并提供包含视觉检测、子片段定位、全序列实体三维运动追踪的精细标注。我们进一步提出MASS——一种模型无关方法,通过基于深度的三维编码与视觉定位将时空信号注入VLM语言空间,并结合用于物体动态追踪的运动跟踪器。为增强跨模态对齐与推理能力,我们采用强化微调策略。实验与消融研究表明,优化后的VLMs以8.7%和6.0%的优势超越同类及更大规模基线模型,并与Gemini-2.5-Flash等闭源前沿VLM在物理推理与理解任务上达到相当性能,验证了本方法的有效性。
English
Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs' perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.
PDF62February 7, 2026