장형 비디오 이해를 위한 텍스트 조건 재샘플러

초록

비디오는 매우 중복성이 높은 데이터 소스이며, 주어진 작업을 해결하기 위해 몇 가지 핵심 순간만 식별하는 것으로도 충분한 경우가 많습니다. 본 논문에서는 사전 학습된 동결 상태의 비주얼 인코더와 대형 언어 모델(LLM)을 활용하여 긴 비디오 시퀀스를 처리하는 텍스트 조건 비디오 리샘플러(TCR) 모듈을 제안합니다. TCR은 텍스트 조건에 따라 비디오에서 관련된 시각적 특징을 찾아내고 이를 LLM에 제공하여 텍스트 응답을 생성합니다. 경량 설계와 교차 주의(cross-attention)를 활용함으로써, TCR은 한 번에 100개 이상의 프레임을 처리할 수 있어 이전 연구들보다 훨씬 더 긴 비디오 청크를 사용할 수 있습니다. 본 연구의 주요 기여는 다음과 같습니다: (i) 작업에 따라 긴 비디오를 처리할 수 있는 트랜스포머 기반 샘플링 아키텍처를 설계하고, 사전 학습된 비주얼 모델과 언어 모델을 연결할 수 있는 훈련 방법을 제안합니다; (ii) 다양한 평가 작업에서 그 효과를 실증적으로 검증하고, NextQA, EgoSchema, 그리고 EGO4D-LTA 챌린지에서 새로운 최첨단 성능을 달성합니다; (iii) 더 긴 비디오 컨텍스트가 필요한 작업들을 식별하여, 장거리 비디오 모델의 추가 평가에 효과적으로 활용할 수 있는 방향을 제시합니다.

English

Videos are highly redundant data source and it is often enough to identify a few key moments to solve any given task. In this paper, we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time allowing the model to use much longer chunks of video than earlier works. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we empirically validate its efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks which require longer video contexts and that can thus be used effectively for further evaluation of long-range video models.

장형 비디오 이해를 위한 텍스트 조건 재샘플러

Text-Conditioned Resampler For Long Form Video Understanding

초록

Support