マルチモーダルLLMのためのトークン効率の良い長尺動画理解

要旨

ビデオベースのマルチモーダル大規模言語モデル（Video-LLMs）の最近の進展により、ビデオを画像フレームのシーケンスとして処理することで、ビデオ理解が大幅に向上しました。しかし、多くの既存の手法では、ビジョンバックボーンにおいてフレームを独立して扱い、明示的な時間的モデリングが欠如しているため、動的なパターンを捉えたり、長いビデオを効率的に処理したりする能力が制限されています。これらの制限に対処するため、我々はSTORM（Spatiotemporal TOken Reduction for Multimodal LLMs）を提案します。これは、画像エンコーダとLLMの間に専用の時間エンコーダを組み込んだ新しいアーキテクチャです。我々の時間エンコーダは、Mamba State Space Modelを活用して、画像トークンに時間情報を統合し、ビデオシーケンス全体にわたるフレーム間の動的関係を保持した豊かな表現を生成します。この豊かなエンコーディングは、ビデオ推論能力を向上させるだけでなく、テストタイムサンプリングやトレーニングベースの時間的および空間的プーリングを含む効果的なトークン削減戦略を可能にし、重要な時間情報を犠牲にすることなくLLMの計算負荷を大幅に削減します。これらの技術を統合することで、我々のアプローチは、トレーニングと推論の遅延を同時に削減し、性能を向上させ、長期的な時間的文脈における効率的で堅牢なビデオ理解を実現します。広範な評価により、STORMがさまざまな長いビデオ理解ベンチマーク（MLVUおよびLongVideoBenchで5％以上の改善）で最先端の結果を達成し、入力フレーム数が固定された場合の計算コストを最大8倍、デコード遅延を2.4～2.9倍削減することが示されました。プロジェクトページはhttps://research.nvidia.com/labs/lpr/stormで公開されています。

English

Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5\% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to 8times and the decoding latency by 2.4-2.9times for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

マルチモーダルLLMのためのトークン効率の良い長尺動画理解

Token-Efficient Long Video Understanding for Multimodal LLMs

要旨

Support