LITA:语言指导的时间定位助手
LITA: Language Instructed Temporal-Localization Assistant
March 27, 2024
作者: De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz
cs.AI
摘要
在多模态大型语言模型(LLMs)方面取得了巨大进展。最近的研究将这些模型扩展到视频输入,并具有有前途的指令跟随能力。然而,一个重要的缺失部分是时间定位。这些模型无法准确回答“何时?”的问题。我们确定了限制它们时间定位能力的三个关键方面:(i)时间表示,(ii)架构和(iii)数据。我们通过提出语言指导的时间定位助手(LITA)来解决这些缺点,具有以下特点:(1)我们引入时间标记,用于编码相对于视频长度的时间戳,以更好地表示视频中的时间。 (2)我们在架构中引入SlowFast标记,以在细粒度时间分辨率下捕获时间信息。 (3)我们强调LITA的时间定位数据。除了利用具有时间戳的现有视频数据集外,我们提出了一个新任务,推理时间定位(RTL),以及用于学习和评估此任务的数据集ActivityNet-RTL。推理时间定位需要视频LLMs的推理和时间定位。LITA在这一具有挑战性的任务上表现出色,几乎使基线的时间平均交集-联合(mIoU)翻了一番。此外,我们展示了我们对时间定位的强调也相对于现有的视频LLMs显著改善了基于视频的文本生成,包括对时间理解的36%相对改善。代码可在以下网址找到:https://github.com/NVlabs/LITA
English
There has been tremendous progress in multimodal Large Language Models
(LLMs). Recent works have extended these models to video input with promising
instruction following capabilities. However, an important missing piece is
temporal localization. These models cannot accurately answer the "When?"
questions. We identify three key aspects that limit their temporal localization
capabilities: (i) time representation, (ii) architecture, and (iii) data. We
address these shortcomings by proposing Language Instructed
Temporal-Localization Assistant (LITA) with the following features: (1) We
introduce time tokens that encode timestamps relative to the video length to
better represent time in videos. (2) We introduce SlowFast tokens in the
architecture to capture temporal information at fine temporal resolution. (3)
We emphasize temporal localization data for LITA. In addition to leveraging
existing video datasets with timestamps, we propose a new task, Reasoning
Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for
learning and evaluating this task. Reasoning temporal localization requires
both the reasoning and temporal localization of Video LLMs. LITA demonstrates
strong performance on this challenging task, nearly doubling the temporal mean
intersection-over-union (mIoU) of baselines. In addition, we show that our
emphasis on temporal localization also substantially improves video-based text
generation compared to existing Video LLMs, including a 36% relative
improvement of Temporal Understanding. Code is available at:
https://github.com/NVlabs/LITASummary
AI-Generated Summary