ST-VLM: Kinematische Instructieafstemming voor Spatio-Temporeel Redeneren in Visie-Taalmodellen

Samenvatting

Spatio-temporeel redeneren is essentieel voor het begrijpen van real-world omgevingen in verschillende domeinen, zoals autonoom rijden en sportanalyses. Recente vooruitgang heeft het ruimtelijk redeneervermogen van Vision-Language Models (VLMs) verbeterd door de introductie van grootschalige data, maar deze modellen hebben nog steeds moeite met het analyseren van kinematische elementen zoals afgelegde afstand en snelheid van bewegende objecten. Om deze kloof te overbruggen, construeren we een spatio-temporeel redeneerdataset en benchmark met kinematische instructieafstemming, genaamd STKit en STKit-Bench. Deze bestaan uit real-world video's met 3D annotaties, die de bewegingsdynamiek van objecten detailleren: afgelegde afstand, snelheid, bewegingsrichting, afstandsvergelijkingen tussen objecten, en relatieve bewegingsrichting. Om de constructie van dergelijke data verder op te schalen naar video's zonder 3D labels, stellen we een automatische pipeline voor om pseudo-labels te genereren met behulp van 4D reconstructie op real-world schaal. Met onze kinematische instructieafstemmingsdata voor spatio-temporeel redeneren, presenteren we ST-VLM, een VLM versterkt voor spatio-temporeel redeneren, dat uitstekende prestaties vertoont op STKit-Bench. Bovendien laten we zien dat ST-VLM robuust generaliseert over diverse domeinen en taken, en de baseline-modellen overtreft op andere spatio-temporele benchmarks (bijv. ActivityNet, TVQA+). Ten slotte maakt ST-VLM, door het geïntegreerde spatio-temporele redeneren te combineren met bestaande vaardigheden, complexe meerstapsredenering mogelijk. Projectpagina: https://ikodoh.github.io/ST-VLM.

English

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.

ST-VLM: Kinematische Instructieafstemming voor Spatio-Temporeel Redeneren in Visie-Taalmodellen

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

Samenvatting

Support