ビジュアルコンテキストウィンドウの拡張：長時間ビデオ理解の新しい視点

要旨

大規模多モーダルモデル（LMMs）は、短いビデオ理解タスクで印象的なパフォーマンスを示していますが、長いビデオ理解に適用する際には大きな課題に直面しています。一方、大規模言語モデル（LLMs）は、長いテキストのモデリングにおいて優れた能力を発揮しています。既存の研究は、トレーニング中に長いビデオテキストペアを導入することで、この問題に対処しようとしています。しかし、これらのアプローチには膨大な計算リソースとデータリソースが必要です。本論文では、文脈ウィンドウの観点から長いビデオ理解の課題に取り組み、LMMsを長いビデオタスクに適用することを目指して、長いビデオデータセットで再トレーニングする必要がない方法を提案します。まず、事前学習済みのLMMsが長いビデオコンテンツを理解するのに苦労する理由について詳細な分析を行い、視覚と言語のモダリティ間の不一致が視覚トークンと言語コンテキストウィンドウを一致させるのを困難にしていることを特定します。これに基づいて、視覚コンテキストウィンドウを拡張することで、大規模な長いビデオデータセットでの再トレーニングを不要にする方法を提案します。さらに、長いシーケンスによって引き起こされる大きなメモリ消費を緩和するために、フレーム埋め込みの空間分解能を選択的に調整するプログレッシブプーリング推論戦略を導入します。複数の長いビデオ理解ベンチマークを通じて、当社の手法は、ビデオフレーム数が増加するにつれて一貫してパフォーマンスを向上させます。MLVUベンチマークでは、当社の手法は、モデルサイズがわずか7Bであるにもかかわらず、GPT-4oを上回ります。さらに、256フレーム設定では、当社の手法は、ベースラインと比較してメモリ使用量を約45％削減し、パフォーマンスの低下をもたらすことなく、メモリ使用量を削減します。

English

Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large scalelong video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.

ビジュアルコンテキストウィンドウの拡張：長時間ビデオ理解の新しい視点

Visual Context Window Extension: A New Perspective for Long Video Understanding

要旨

Support