ST-VLM: 시공간 추론을 위한 운동학적 명령어 튜닝 기반 비전-언어 모델

초록

시공간 추론은 자율 주행 및 스포츠 분석 등 다양한 분야에서 실제 환경을 이해하는 데 필수적입니다. 최근 대규모 데이터 도입을 통해 비전-언어 모델(VLMs)의 공간 추론 능력이 향상되었지만, 이러한 모델들은 여전히 이동 거리와 속도와 같은 운동학적 요소를 분석하는 데 어려움을 겪고 있습니다. 이러한 격차를 해소하기 위해, 우리는 운동학적 명령 튜닝을 포함한 시공간 추론 데이터셋과 벤치마크인 STKit과 STKit-Bench를 구축했습니다. 이들은 3D 주석이 포함된 실제 영상으로 구성되어 있으며, 이동 거리, 속도, 이동 방향, 객체 간 거리 비교, 상대적 이동 방향과 같은 객체 운동 역학을 상세히 설명합니다. 또한 3D 레이블이 없는 영상에 대해 이러한 데이터 구축을 확장하기 위해, 실제 규모의 4D 재구성을 사용하여 자동으로 가짜 레이블을 생성하는 파이프라인을 제안합니다. 우리의 시공간 추론을 위한 운동학적 명령 튜닝 데이터를 활용하여, 시공간 추론 능력이 강화된 VLM인 ST-VLM을 제시하며, 이는 STKit-Bench에서 뛰어난 성능을 보입니다. 더 나아가, ST-VLM은 다양한 도메인과 작업에서 강력한 일반화 능력을 보이며, 다른 시공간 벤치마크(예: ActivityNet, TVQA+)에서 기준 모델을 능가합니다. 마지막으로, 학습된 시공간 추론 능력을 기존 능력과 통합함으로써, ST-VLM은 복잡한 다단계 추론을 가능하게 합니다. 프로젝트 페이지: https://ikodoh.github.io/ST-VLM.

English

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.

ST-VLM: 시공간 추론을 위한 운동학적 명령어 튜닝 기반 비전-언어 모델

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

초록

Support