ChatPaper.aiChatPaper

基於文本條件的重新取樣器用於長格式視頻理解

Text-Conditioned Resampler For Long Form Video Understanding

December 19, 2023
作者: Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari
cs.AI

摘要

影片是高度冗餘的資料來源,通常只需識別一些關鍵時刻即可解決任務。本文提出了一種文本條件下的影片重採樣器(TCR)模組,該模組使用預先訓練且凍結的視覺編碼器和大型語言模型(LLM)來處理長影片序列以完成任務。TCR根據文本條件定位影片中的相關視覺特徵,並將其提供給LLM生成文本回應。由於其輕量級設計和使用交叉注意力,TCR可以一次處理超過100幀的影片,使模型能夠使用比先前作品更長的影片片段。我們的貢獻如下:(i)我們設計了一種基於Transformer的採樣架構,可以根據任務處理長影片,並提供一種訓練方法,使其能夠連接預先訓練的視覺和語言模型;(ii)我們在廣泛的評估任務上實證了其有效性,並在NextQA、EgoSchema和EGO4D-LTA挑戰賽上設立了新的最先進水準;(iii)我們確定了需要更長影片上下文的任務,因此可以有效地用於進一步評估長程影片模型。
English
Videos are highly redundant data source and it is often enough to identify a few key moments to solve any given task. In this paper, we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time allowing the model to use much longer chunks of video than earlier works. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we empirically validate its efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks which require longer video contexts and that can thus be used effectively for further evaluation of long-range video models.
PDF61December 15, 2024