一度にすべてのフレームを見る：マルチ軸勾配チェックポイントを使用した効率的な長尺ビデオ理解のためのVideo-Ma^2mba

要旨

ビデオデータの規模と複雑さが増大するにつれ、既存のトランスフォーマーベースの大規模マルチモーダルモデル（LMMs）に伴うメモリおよび計算要件の二次的増加により、長いビデオシーケンスを効率的に処理することは著しい課題を提起します。これらの問題に対処するため、私たちはアテンションメカニズムを置き換えることで、Mamba-2フレームワーク内にState Space Models（SSMs）を組み込んだ新しいアーキテクチャであるVideo-Ma^2mbaを導入します。これにより、LMMsは時間およびメモリ要件に関して線形にスケーリングすることが可能となり、長時間のビデオコンテンツを処理することが実現します。さらに、マルチアクシス勾配チェックポイント（MA-GC）手法を導入することでメモリ効率を向上させ、複数の計算軸にわたって必要なアクティベーションのみを保持することで、標準的な勾配チェックポイントに比べて著しくメモリフットプリントを削減します。実証分析によると、Video-Ma^2mbaは1つのGPU上で数百万のトークンに相当する広範なビデオシーケンス、または2時間以上の連続シーケンスを1 FPSで処理することができます。時間的ダイナミクスの詳細なキャプチャを維持することで、当社のモデルは長いビデオ理解タスクにおいて応答の精度と関連性を向上させ、既存のフレームワークに比べて著しい利点を示します。

English

With the growing scale and complexity of video data, efficiently processing long video sequences poses significant challenges due to the quadratic increase in memory and computational demands associated with existing transformer-based Large Multi-modal Models (LMMs). To address these issues, we introduce Video-Ma^2mba, a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework, replacing the attention mechanisms. This allows the LMMs to scale linearly in terms of time and memory requirements, making it feasible to handle long-duration video content. Furthermore, we enhance the memory efficiency introducing the Multi-Axis Gradient Checkpointing (MA-GC) method, which strategically manages memory by retaining only essential activations across multiple computational axes. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. Empirical analyses show that Video-Ma^2mba can process extensive video sequences-equivalent to millions of tokens or over two hours of continuous sequences at 1 FPS-on a single GPU. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks, demonstrating substantial advantages over existing frameworks.

一度にすべてのフレームを見る：マルチ軸勾配チェックポイントを使用した効率的な長尺ビデオ理解のためのVideo-Ma^2mba

Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

要旨

Support