LongVU：長時間のビデオ言語理解のための時空間適応圧縮

要旨

マルチモーダル大規模言語モデル（MLLMs）は、ビデオコンテンツの理解と分析において有望な進展を示しています。ただし、長時間のビデオの処理は、LLMのコンテキストサイズによって制約される重要な課題です。この制約に対処するために、私たちはLongVUを提案します。これは、長時間のビデオのビジュアル詳細を保持しながらビデオトークンの数を削減する空間時間適応型圧縮メカニズムです。私たちのアイデアは、クロスモーダルクエリとフレーム間依存関係を活用して、ビデオ内の時間的および空間的な冗長性を適応的に削減することに基づいています。具体的には、高い類似性を示す冗長なフレームを取り除くためにDINOv2の特徴を活用します。その後、選択的なフレーム特徴の削減のためにテキストによるクロスモーダルクエリを利用します。さらに、フレーム間の時間的依存関係に基づいてフレーム間の空間トークンの削減を行います。私たちの適応的圧縮戦略は、与えられたコンテキスト長内で視覚情報の損失を最小限に抑えながら多数のフレームを効果的に処理します。LongVUは、VideoMMEやMLVUなどの長時間ビデオ理解タスクを含むさまざまなビデオ理解ベンチマークで、既存の手法を一貫して上回ります。軽量なLLMを使用する場合、LongVUは、最先端のビデオ理解性能を維持しながら、効果的に小さなサイズにスケーリングします。

English

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

LongVU：長時間のビデオ言語理解のための時空間適応圧縮

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

要旨

Support