Mavors: マルチモーダル大規模言語モデルのためのマルチ粒度ビデオ表現

要旨

マルチモーダル大規模言語モデル（MLLMs）における長文脈ビデオ理解は、計算効率と細粒度の時空間パターンの保持とのバランスを取るという重要な課題に直面しています。既存のアプローチ（例えば、疎サンプリング、低解像度での密サンプリング、トークン圧縮など）は、特に複雑な動きや解像度が変化するビデオにおいて、時間的ダイナミクス、空間的詳細、または微妙な相互作用において重大な情報損失を被ります。これを解決するために、我々はMavorsという新しいフレームワークを提案します。Mavorsは、ホリスティックな長尺ビデオモデリングのためのマルチグラニュラリティビデオ表現を導入します。具体的には、Mavorsは生のビデオコンテンツを潜在表現に直接エンコードするために、2つのコアコンポーネントを備えています：1）3D畳み込みとVision Transformersを介して高解像度の空間的特徴を保持するIntra-chunk Vision Encoder（IVE）、および2）チャンクレベルロータリーポジションエンコーディングを用いたトランスフォーマーベースの依存関係モデリングにより、チャンク間の時間的整合性を確立するInter-chunk Feature Aggregator（IFA）。さらに、このフレームワークは、画像をサブ画像分解を介して単一フレームのビデオとして扱うことで、画像とビデオの理解を統合します。多様なベンチマークでの実験により、Mavorsが空間的忠実度と時間的連続性の両方を維持する優位性が示され、細粒度の時空間推論を必要とするタスクにおいて既存の手法を大幅に上回ることが実証されました。

English

Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose Mavors, a novel framework that introduces Multi-granularity video representation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.

Mavors: マルチモーダル大規模言語モデルのためのマルチ粒度ビデオ表現

Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

要旨

Support