ChatPaper.aiChatPaper

从片段到场景:基于视觉语言模型的自动驾驶时序理解

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

December 4, 2025
作者: Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari
cs.AI

摘要

自动驾驶领域的时间理解能力仍是重大挑战,即便是当前最先进的视觉语言模型也难以应对。先前研究虽已推出旨在提升时序推理能力的数据集与基准测试,但其侧重点多集中于体育、烹饪、电影等视频内容。目前尚无专门针对自动驾驶第一视角视频中时序理解独特挑战的基准测试。为填补这一空白,我们提出自动驾驶时序理解基准测试,用于评估视觉语言模型捕捉自动驾驶场景中动作动态关系的能力。该基准包含近6000组问答对,涵盖7项人工设计的任务。此外,我们对9个开源/闭源通用模型以及最先进的自动驾驶专用模型进行了评估。测试显示,当前最先进模型在TAD基准上的准确率均未达理想水平,主要归因于细粒度运动理解的不足。为提升运动理解能力及TAD基准的整体准确率,我们提出两种无需训练的创新解决方案:基于思维链的场景推理框架,以及融合第一视角时序认知图的时空映射技术。这些方法与现有视觉语言模型集成后,可将TAD基准的平均准确率最高提升17.72%。通过推出TAD基准、对多类最先进模型进行基准测试,并提出有效增强方案,本研究旨在推动自动驾驶时序理解领域的后续探索。基准测试数据与评估代码已分别发布于Hugging Face平台(https://huggingface.co/datasets/vbdai/TAD)和GitHub代码库(https://github.com/vbdi/tad_bench)。
English
Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at https://huggingface.co/datasets/vbdai/TAD{Hugging Face} and https://github.com/vbdi/tad_bench{Github}, respectively.
PDF42December 9, 2025