LITA: 言語指示型時間的ローカライゼーションアシスタント

要旨

マルチモーダル大規模言語モデル（LLM）において、大きな進展が見られています。最近の研究では、これらのモデルをビデオ入力に拡張し、有望な指示追従能力を示しています。しかし、重要な欠落要素は時間的ローカライゼーションです。これらのモデルは「いつ？」という質問に正確に答えることができません。私たちは、時間的ローカライゼーション能力を制限する3つの主要な側面を特定しました：（i）時間表現、（ii）アーキテクチャ、（iii）データ。これらの欠点を解決するために、以下の特徴を持つLanguage Instructed Temporal-Localization Assistant（LITA）を提案します：（1）ビデオの長さに対するタイムスタンプをエンコードする時間トークンを導入し、ビデオ内の時間をより適切に表現します。（2）アーキテクチャにSlowFastトークンを導入し、細かい時間解像度で時間情報を捕捉します。（3）LITAのための時間的ローカライゼーションデータを重視します。既存のタイムスタンプ付きビデオデータセットを活用するだけでなく、新しいタスクであるReasoning Temporal Localization（RTL）とそのデータセットActivityNet-RTLを提案し、このタスクの学習と評価を行います。推論的時間的ローカライゼーションは、ビデオLLMの推論と時間的ローカライゼーションの両方を必要とします。LITAは、この挑戦的なタスクにおいて強力な性能を示し、ベースラインの時間的平均IoU（mIoU）をほぼ2倍にしました。さらに、時間的ローカライゼーションを重視することで、既存のビデオLLMと比較してビデオベースのテキスト生成も大幅に改善され、時間的理解において36％の相対的改善が見られました。コードは以下で利用可能です：https://github.com/NVlabs/LITA

English

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA

LITA: 言語指示型時間的ローカライゼーションアシスタント

LITA: Language Instructed Temporal-Localization Assistant

要旨

Support