長尺動画理解のためのテキスト条件付きリサンプラー

要旨

動画は高度に冗長なデータソースであり、特定のタスクを解決するためにはいくつかの重要な瞬間を特定するだけで十分な場合が多い。本論文では、事前学習済みで固定された視覚エンコーダと大規模言語モデル（LLM）を使用して、長い動画シーケンスをタスクに応じて処理するテキスト条件付き動画リサンプラ（TCR）モジュールを提案する。TCRは、テキスト条件に基づいて動画から関連する視覚的特徴を特定し、それらをLLMに提供してテキスト応答を生成する。軽量な設計とクロスアテンションの使用により、TCRは一度に100フレーム以上を処理でき、従来の研究よりもはるかに長い動画のチャンクを利用できる。我々は以下の貢献を行う：（i）タスクに応じて長い動画を処理できるトランスフォーマーベースのサンプリングアーキテクチャと、事前学習済みの視覚モデルと言語モデルを橋渡しするトレーニング方法を設計する；（ii）多様な評価タスクでその有効性を実証し、NextQA、EgoSchema、およびEGO4D-LTAチャレンジにおいて新たな最先端の結果を達成する；（iii）長い動画コンテキストを必要とするタスクを特定し、長距離動画モデルのさらなる評価に効果的に活用できることを示す。

English

Videos are highly redundant data source and it is often enough to identify a few key moments to solve any given task. In this paper, we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time allowing the model to use much longer chunks of video than earlier works. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we empirically validate its efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks which require longer video contexts and that can thus be used effectively for further evaluation of long-range video models.

長尺動画理解のためのテキスト条件付きリサンプラー

Text-Conditioned Resampler For Long Form Video Understanding

要旨

Support