从片段到场景:基于视觉语言模型的自动驾驶时序理解
From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model
December 4, 2025
作者: Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari
cs.AI
摘要
自动驾驶中的时序理解能力仍是重大挑战,即使对当前最先进的视觉语言模型也不例外。现有研究虽推出了旨在提升时序推理的数据集与基准测试,但主要聚焦于体育、烹饪、电影等视频内容,尚未出现专门针对自动驾驶第一视角视频时序理解特性的评测基准。为填补这一空白,本文提出自动驾驶时序理解基准测试TAD,用于评估VLM捕捉自动驾驶场景中动作间动态关系的能力。TAD包含近6000组问答对,涵盖7项人工设计的任务。此外,我们对9个开源/闭源通用模型以及SOTA自动驾驶专用模型进行了系统评估。实验表明,当前SOTA模型在TAD上的准确率表现欠佳,主要源于其对细粒度运动理解存在不足。为提升运动理解能力及TAD整体表现,我们提出两种无需训练的创新方案:采用思维链技术的Scene-CoT框架,以及融合第一视角时序认知图的TCogMap方法。这些方案与现有VLM集成后,将TAD平均准确率最高提升17.72%。通过建立TAD基准、评测多类SOTA模型并提出有效增强方法,本研究旨在推动自动驾驶时序理解领域的后续探索。基准数据与评测代码已分别发布于https://huggingface.co/datasets/vbdai/TAD 和 https://github.com/vbdi/tad_bench。
English
Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at https://huggingface.co/datasets/vbdai/TAD{Hugging Face} and https://github.com/vbdi/tad_bench{Github}, respectively.