VLM4D：視覚言語モデルにおける時空間認識へのアプローチ

要旨

視覚言語モデル（VLM）は、言語的推論と視覚的推論を統合する際に顕著な能力を示してきたが、動的な時空間的相互作用を理解する点では根本的に限界がある。人間は、物体の動き、回転、視点の変化を容易に追跡し、推論する能力を持っている。これは、現実世界の動的な理解において不可欠な能力であるが、現在のVLMには明らかに欠けている。本論文では、VLMの時空間的推論能力を評価するために特別に設計された最初のベンチマークであるVLM4Dを紹介する。このベンチマークは、並進運動や回転運動、視点認識、運動の連続性を強調した、多様な実世界および合成動画と、慎重に選ばれた質問-回答ペアで構成されている。最先端のオープンソースおよびクローズドソースのVLMを包括的に評価した結果、人間のベースラインと比較して大きな性能ギャップが確認され、既存モデルの根本的な欠陥が浮き彫りになった。詳細な分析により、VLMは特に複数の視覚的手がかりを統合し、時間的整合性を維持することに苦労していることが明らかになった。さらに、4D特徴フィールド再構成や特定の時空間的教師ありファインチューニングなどの有望な方向性を探り、それらが時空間的理解を強化する上で有効であることを実証した。本研究は、VLMの空間的および時間的基盤を改善するための深い探求を促し、動的環境におけるより有能で信頼性の高い視覚的知能への道を開くことを目指している。

English

Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

VLM4D：視覚言語モデルにおける時空間認識へのアプローチ

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

要旨

Support