ST-VLM：面向视觉语言模型时空推理的运动学指令微调

摘要

时空推理在理解现实世界环境中至关重要，广泛应用于自动驾驶和体育分析等多个领域。尽管近期通过引入大规模数据提升了视觉-语言模型（VLMs）的空间推理能力，这些模型在分析运动物体的行进距离和速度等运动学要素时仍显不足。为弥合这一差距，我们构建了一个包含运动学指令调优的时空推理数据集及基准测试，分别命名为STKit和STKit-Bench。该数据集由带有3D标注的真实世界视频组成，详细记录了物体运动动态：行进距离、速度、移动方向、物体间距离比较及相对移动方向。为进一步扩展此类数据至无3D标签的视频，我们提出了一种自动化流程，利用真实世界尺度的4D重建生成伪标签。基于我们的运动学指令调优数据，我们推出了ST-VLM，这是一款专为时空推理增强的VLM，在STKit-Bench上展现了卓越性能。此外，ST-VLM在跨领域和任务中展现出强大的泛化能力，在其他时空基准测试（如ActivityNet、TVQA+）上超越基线模型。最终，通过将学习到的时空推理能力与现有能力相结合，ST-VLM实现了复杂的多步推理。项目页面：https://ikodoh.github.io/ST-VLM。

English

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.

ST-VLM：面向视觉语言模型时空推理的运动学指令微调

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

摘要

Support