ST-VLM：面向視覺語言模型中時空推理的運動學指令微調

摘要

時空推理在理解現實世界環境中至關重要，應用於多個領域，如自動駕駛和體育分析。近期進展通過引入大規模數據提升了視覺-語言模型（VLMs）的空間推理能力，但這些模型在分析運動物體的運動學元素（如行進距離和速度）方面仍存在困難。為彌補這一差距，我們構建了一個涉及運動學指令調優的時空推理數據集和基準測試，分別稱為STKit和STKit-Bench。它們包含帶有3D註釋的真實世界視頻，詳細描述了物體運動動態：行進距離、速度、運動方向、物體間距離比較以及相對運動方向。為了進一步將此類數據構建擴展到無3D標籤的視頻，我們提出了一種自動化流程，利用4D重建在真實世界尺度上生成偽標籤。基於我們為時空推理提供的運動學指令調優數據，我們推出了ST-VLM，這是一款專為時空推理增強的VLM，其在STKit-Bench上展現出卓越性能。此外，我們展示了ST-VLM在多樣領域和任務中的強健泛化能力，在其他時空基準測試（如ActivityNet、TVQA+）上超越基線模型。最終，通過將學習到的時空推理與現有能力相結合，ST-VLM實現了複雜的多步推理。項目頁面：https://ikodoh.github.io/ST-VLM。

English

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.

ST-VLM：面向視覺語言模型中時空推理的運動學指令微調

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

摘要

Support