ST-VLM:面向視覺語言模型中時空推理的運動學指令微調
ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
March 25, 2025
作者: Dohwan Ko, Sihyeon Kim, Yumin Suh, Vijay Kumar B. G, Minseo Yoon, Manmohan Chandraker, Hyunwoo J. Kim
cs.AI
摘要
時空推理在理解現實世界環境中至關重要,應用於多個領域,如自動駕駛和體育分析。近期進展通過引入大規模數據提升了視覺-語言模型(VLMs)的空間推理能力,但這些模型在分析運動物體的運動學元素(如行進距離和速度)方面仍存在困難。為彌補這一差距,我們構建了一個涉及運動學指令調優的時空推理數據集和基準測試,分別稱為STKit和STKit-Bench。它們包含帶有3D註釋的真實世界視頻,詳細描述了物體運動動態:行進距離、速度、運動方向、物體間距離比較以及相對運動方向。為了進一步將此類數據構建擴展到無3D標籤的視頻,我們提出了一種自動化流程,利用4D重建在真實世界尺度上生成偽標籤。基於我們為時空推理提供的運動學指令調優數據,我們推出了ST-VLM,這是一款專為時空推理增強的VLM,其在STKit-Bench上展現出卓越性能。此外,我們展示了ST-VLM在多樣領域和任務中的強健泛化能力,在其他時空基準測試(如ActivityNet、TVQA+)上超越基線模型。最終,通過將學習到的時空推理與現有能力相結合,ST-VLM實現了複雜的多步推理。項目頁面:https://ikodoh.github.io/ST-VLM。
English
Spatio-temporal reasoning is essential in understanding real-world
environments in various fields, eg, autonomous driving and sports analytics.
Recent advances have improved the spatial reasoning ability of Vision-Language
Models (VLMs) by introducing large-scale data, but these models still struggle
to analyze kinematic elements like traveled distance and speed of moving
objects. To bridge this gap, we construct a spatio-temporal reasoning dataset
and benchmark involving kinematic instruction tuning, referred to as STKit and
STKit-Bench. They consist of real-world videos with 3D annotations, detailing
object motion dynamics: traveled distance, speed, movement direction,
inter-object distance comparisons, and relative movement direction. To further
scale such data construction to videos without 3D labels, we propose an
automatic pipeline to generate pseudo-labels using 4D reconstruction in
real-world scale. With our kinematic instruction tuning data for
spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for
spatio-temporal reasoning, which exhibits outstanding performance on
STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across
diverse domains and tasks, outperforming baselines on other spatio-temporal
benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned
spatio-temporal reasoning with existing abilities, ST-VLM enables complex
multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.Summary
AI-Generated Summary