ChatPaper.aiChatPaper

VLM4D:迈向视觉语言模型的时空感知能力

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

August 4, 2025
作者: Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, Achuta Kadambi
cs.AI

摘要

視覺語言模型(VLMs)在整合語言與視覺推理方面展現了顯著的能力,但在理解動態時空交互作用方面仍存在根本性限制。人類能夠輕鬆追蹤並推理物體的移動、旋轉及視角轉換——這些能力對於穩健的動態現實世界理解至關重要,卻在當前VLMs中明顯缺失。本文中,我們介紹了VLM4D,這是首個專門設計用於評估VLMs時空推理能力的基準測試。我們的基準包含多樣化的真實世界與合成視頻,並配備了精心設計的問答對,重點關注平移與旋轉運動、視角意識及運動連續性。通過對現有開源與閉源VLMs的全面評估,我們發現與人類基準相比存在顯著的性能差距,凸顯了現有模型的基本缺陷。深入分析表明,VLMs在整合多重視覺線索及保持時間一致性方面尤為困難。我們進一步探索了有前景的方向,如利用四維特徵場重建及針對性的時空監督微調,展示了它們在增強時空理解方面的有效性。我們的工作旨在鼓勵更深入地探索提升VLMs的空間與時間基礎,為動態環境中更強大且可靠的視覺智能鋪平道路。
English
Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.
PDF62August 12, 2025