時間的動的コンテキストに基づくマルチモーダル長尺動画モデリング

要旨

大規模言語モデル（LLMs）の最近の進歩により、ビデオ理解において重要なブレークスルーがもたらされました。しかし、既存のモデルは、LLMsのコンテキスト長制約とビデオ内の膨大な情報量のため、長時間のビデオ処理に苦戦しています。最近のいくつかの手法は長時間ビデオ理解のために設計されていますが、トークン圧縮中に重要な情報を失い、音声などの追加モダリティに対応するのが難しい場合があります。本研究では、フレーム間の時間的関係を利用した動的長時間ビデオエンコーディング手法、Temporal Dynamic Context（TDC）を提案します。まず、ビデオをフレーム間の類似性に基づいて意味的に一貫したシーンに分割し、各フレームを視覚-音声エンコーダーを使用してトークンにエンコードします。次に、各セグメント内のトークン数を削減するための新しい時間的コンテキスト圧縮器を提案します。具体的には、クエリベースのTransformerを使用して、ビデオ、音声、および指示テキストのトークンを限られた数の時間的コンテキストトークンに集約します。最後に、静的フレームトークンと時間的コンテキストトークンをLLMに供給してビデオ理解を行います。さらに、非常に長時間のビデオを処理するために、トレーニング不要の連鎖思考（chain-of-thought）戦略を提案します。この戦略では、複数のビデオセグメントから段階的に回答を抽出し、これらの中間回答が推論プロセスの一部として機能し、最終的な回答に貢献します。一般的なビデオ理解および音声-ビデオ理解のベンチマークで広範な実験を行い、本手法が優れた性能を示すことを確認しました。コードとモデルはhttps://github.com/Hoar012/TDC-Videoで公開されています。

English

Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.

時間的動的コンテキストに基づくマルチモーダル長尺動画モデリング

Multimodal Long Video Modeling Based on Temporal Dynamic Context

要旨

Support