VLM4D：迈向时空感知的视觉语言模型

摘要

视觉语言模型（VLMs）在整合语言与视觉推理方面展现了显著能力，但在理解动态时空交互方面仍存在根本性局限。人类能够轻松追踪并推理物体的移动、旋转及视角变换——这些能力对于在动态现实世界中实现稳健理解至关重要，而当前的VLMs却明显欠缺。本文中，我们推出了VLM4D，这是首个专门设计用于评估VLMs时空推理能力的基准测试。该基准包含多样化的真实世界与合成视频，并配有精心策划的问答对，着重考察平移与旋转运动、视角感知及运动连续性。通过对最先进的开放与闭源VLMs进行全面评估，我们发现其与人类基准相比存在显著性能差距，揭示了现有模型的基本缺陷。深入分析表明，VLMs在整合多重视觉线索及保持时间连贯性方面尤为困难。我们进一步探索了有前景的研究方向，如利用4D特征场重建及针对性的时空监督微调，证明了这些方法在增强时空理解上的有效性。本研究旨在激励更深入的探索，以提升VLMs的空间与时间定位能力，为开发更强大、更可靠的动态环境视觉智能铺平道路。

English

Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

VLM4D：迈向时空感知的视觉语言模型

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

摘要

Support