ChatPaper.aiChatPaper

LITA:語言指導的時間定位助理

LITA: Language Instructed Temporal-Localization Assistant

March 27, 2024
作者: De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz
cs.AI

摘要

在多模式大型語言模型(LLMs)方面取得了巨大進展。最近的研究將這些模型擴展到視頻輸入,具有有前途的指示跟隨能力。然而,一個重要的缺失部分是時間定位。這些模型無法準確回答“何時?”的問題。我們確定了限制它們時間定位能力的三個關鍵方面:(i)時間表示、(ii)架構和(iii)數據。我們通過提出語言指導的時間定位助手(LITA)來解決這些缺點,具有以下功能:(1)我們引入時間標記,將時間戳編碼為相對於視頻長度的時間,以更好地表示視頻中的時間。 (2)我們在架構中引入SlowFast標記,以捕捉細粒度時間解析度的時間信息。 (3)我們強調LITA的時間定位數據。除了利用具有時間戳的現有視頻數據集外,我們提出了一個新任務,即推理時間定位(RTL),以及用於學習和評估此任務的數據集ActivityNet-RTL。推理時間定位需要視頻LLMs的推理和時間定位。LITA在這個具有挑戰性的任務上表現出色,幾乎使基線的時間平均交集-聯合(mIoU)翻倍。此外,我們展示了我們對時間定位的強調也相對於現有的視頻LLMs顯著改善了基於視頻的文本生成,包括對時間理解的36%相對改善。代碼可在以下網址找到:https://github.com/NVlabs/LITA
English
There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA

Summary

AI-Generated Summary

PDF201December 15, 2024