ST-VLM：視覚言語モデルにおける時空間推論のための運動学的指示チューニング

要旨

時空間推論は、自動運転やスポーツ分析など様々な分野において、現実世界の環境を理解する上で不可欠です。近年の進歩により、大規模データの導入によってVision-Language Models（VLMs）の空間推論能力は向上しましたが、移動物体の移動距離や速度などの運動学的要素を分析する点では依然として課題が残っています。このギャップを埋めるため、我々は運動学的指示チューニングを伴う時空間推論データセットとベンチマーク、STKitおよびSTKit-Benchを構築しました。これらは3Dアノテーションを伴う実世界のビデオで構成され、物体の運動ダイナミクス（移動距離、速度、移動方向、物体間距離の比較、相対的な移動方向）を詳細に記述しています。さらに、3Dラベルがないビデオに対しても、実世界スケールでの4D再構成を用いて擬似ラベルを生成する自動パイプラインを提案します。我々の運動学的指示チューニングデータを用いて、時空間推論を強化したVLMであるST-VLMを提示し、STKit-Benchで優れた性能を発揮することを示します。さらに、ST-VLMが多様なドメインやタスクにわたって堅牢に一般化し、他の時空間ベンチマーク（例：ActivityNet、TVQA+）においてベースラインを上回ることを示します。最後に、学習した時空間推論を既存の能力と統合することで、ST-VLMは複雑な多段階推論を可能にします。プロジェクトページ: https://ikodoh.github.io/ST-VLM。

English

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.

ST-VLM：視覚言語モデルにおける時空間推論のための運動学的指示チューニング

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

要旨

Support