InternVideo2: マルチモーダル動画理解のためのビデオ基盤モデルのスケーリング

要旨

私たちは、行動認識、ビデオテキストタスク、ビデオ中心の対話において最先端の性能を達成する新しいビデオ基盤モデル（ViFM）であるInternVideo2を紹介します。私たちのアプローチは、マスクされたビデオトークンの再構築、クロスモーダルコントラスティブラーニング、および次のトークン予測という異なる自己または弱教師あり学習フレームワークを統合する段階的なトレーニングパラダイムを採用しています。異なるトレーニング段階は、異なるプレテキストタスクを通じて、モデルが異なるレベルの構造と意味情報を捕捉することを導きます。データレベルでは、ビデオを意味的にセグメント化し、ビデオ-オーディオ-音声キャプションを生成することで、時空間的一貫性を優先します。これにより、ビデオとテキストの整合性が向上します。私たちはInternVideo2のデータとモデルサイズをスケールアップしました。広範な実験を通じて、私たちの設計を検証し、60以上のビデオおよびオーディオタスクにおいて最先端の性能を実証しました。特に、私たちのモデルは、さまざまなビデオ関連のキャプション作成、対話、および長いビデオ理解のベンチマークで他のモデルを上回り、長い時間的文脈を推論し理解する能力を強調しています。コードとモデルはhttps://github.com/OpenGVLab/InternVideo2/で利用可能です。

English

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach employs a progressive training paradigm that unifies the different self- or weakly-supervised learning frameworks of masked video token reconstruction, cross-modal contrastive learning, and next token prediction. Different training stages would guide our model to capture different levels of structure and semantic information through different pretext tasks. At the data level, we prioritize the spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. We scale both data and model size for our InternVideo2. Through extensive experiments, we validate our designs and demonstrate the state-of-the-art performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related captioning, dialogue, and long video understanding benchmarks, highlighting its ability to reason and comprehend long temporal contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo2/.

InternVideo2: マルチモーダル動画理解のためのビデオ基盤モデルのスケーリング

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

要旨

Support