LITA: 언어 지시형 시간적 위치 파악 보조 시스템

초록

멀티모달 대형 언어 모델(LLMs)에서 엄청난 진전이 있었습니다. 최근 연구들은 이러한 모델을 비디오 입력으로 확장하여 유망한 지시 수행 능력을 보여주었습니다. 그러나 중요한 결여 요소는 시간적 위치 지정(temporal localization)입니다. 이러한 모델들은 "언제?"라는 질문에 정확하게 답할 수 없습니다. 우리는 시간적 위치 지정 능력을 제한하는 세 가지 주요 측면을 확인했습니다: (i) 시간 표현, (ii) 아키텍처, (iii) 데이터. 이러한 단점을 해결하기 위해 다음과 같은 특징을 가진 언어 지시 시간적 위치 지정 도우미(Language Instructed Temporal-Localization Assistant, LITA)를 제안합니다: (1) 비디오 길이에 상대적인 타임스탬프를 인코딩하는 시간 토큰을 도입하여 비디오에서 시간을 더 잘 표현합니다. (2) 아키텍처에 SlowFast 토큰을 도입하여 세밀한 시간 해상도로 시간적 정보를 포착합니다. (3) LITA를 위해 시간적 위치 지정 데이터를 강조합니다. 타임스탬프가 있는 기존 비디오 데이터셋을 활용하는 것 외에도, 이 작업을 학습하고 평가하기 위한 새로운 작업인 추론 시간적 위치 지정(Reasoning Temporal Localization, RTL)과 데이터셋인 ActivityNet-RTL을 제안합니다. 추론 시간적 위치 지정은 비디오 LLM의 추론과 시간적 위치 지정이 모두 필요합니다. LITA는 이 도전적인 작업에서 강력한 성능을 보여주며, 기준선의 시간적 평균 교차율(temporal mean intersection-over-union, mIoU)을 거의 두 배로 향상시켰습니다. 또한, 시간적 위치 지정에 대한 강조가 기존 비디오 LLM에 비해 비디오 기반 텍스트 생성도 크게 개선시켰음을 보여주며, 시간적 이해(Temporal Understanding)에서 36%의 상대적 개선을 달성했습니다. 코드는 https://github.com/NVlabs/LITA에서 확인할 수 있습니다.

English

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA

LITA: 언어 지시형 시간적 위치 파악 보조 시스템

LITA: Language Instructed Temporal-Localization Assistant

초록

Summary

Support

Support