LITA：语言指导的时间定位助手

摘要

在多模态大型语言模型（LLMs）方面取得了巨大进展。最近的研究将这些模型扩展到视频输入，并具有有前途的指令跟随能力。然而，一个重要的缺失部分是时间定位。这些模型无法准确回答“何时？”的问题。我们确定了限制它们时间定位能力的三个关键方面：（i）时间表示，（ii）架构和（iii）数据。我们通过提出语言指导的时间定位助手（LITA）来解决这些缺点，具有以下特点：（1）我们引入时间标记，用于编码相对于视频长度的时间戳，以更好地表示视频中的时间。（2）我们在架构中引入SlowFast标记，以在细粒度时间分辨率下捕获时间信息。（3）我们强调LITA的时间定位数据。除了利用具有时间戳的现有视频数据集外，我们提出了一个新任务，推理时间定位（RTL），以及用于学习和评估此任务的数据集ActivityNet-RTL。推理时间定位需要视频LLMs的推理和时间定位。LITA在这一具有挑战性的任务上表现出色，几乎使基线的时间平均交集-联合（mIoU）翻了一番。此外，我们展示了我们对时间定位的强调也相对于现有的视频LLMs显著改善了基于视频的文本生成，包括对时间理解的36％相对改善。代码可在以下网址找到：https://github.com/NVlabs/LITA

English

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA

LITA：语言指导的时间定位助手

LITA: Language Instructed Temporal-Localization Assistant

摘要

Support