基于文本条件的重采样器用于长视频理解
Text-Conditioned Resampler For Long Form Video Understanding
December 19, 2023
作者: Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari
cs.AI
摘要
视频是高度冗余的数据源,通常只需识别一些关键时刻即可解决任何给定任务。本文介绍了一种文本条件视频重采样(TCR)模块,该模块使用预训练且冻结的视觉编码器和大型语言模型(LLM)来处理长视频序列以完成任务。TCR根据文本条件定位视频中相关的视觉特征,并将其提供给LLM生成文本响应。由于其轻量级设计和使用交叉注意力,TCR可以一次处理超过100帧,使模型能够使用比先前作品更长的视频块。我们的贡献如下:(i)设计了基于Transformer的采样架构,可以根据任务处理长视频,结合训练方法使其能够连接预训练的视觉和语言模型;(ii)在各种评估任务上进行了实证验证,同时在NextQA、EgoSchema和EGO4D-LTA挑战赛上树立了新的最先进水平;(iii)确定需要更长视频上下文的任务,因此可以有效地用于长距离视频模型的进一步评估。
English
Videos are highly redundant data source and it is often enough to identify a
few key moments to solve any given task. In this paper, we present a
text-conditioned video resampler (TCR) module that uses a pre-trained and
frozen visual encoder and large language model (LLM) to process long video
sequences for a task. TCR localises relevant visual features from the video
given a text condition and provides them to a LLM to generate a text response.
Due to its lightweight design and use of cross-attention, TCR can process more
than 100 frames at a time allowing the model to use much longer chunks of video
than earlier works. We make the following contributions: (i) we design a
transformer-based sampling architecture that can process long videos
conditioned on a task, together with a training method that enables it to
bridge pre-trained visual and language models; (ii) we empirically validate its
efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art
on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks
which require longer video contexts and that can thus be used effectively for
further evaluation of long-range video models.