InternVid: マルチモーダル理解と生成のための大規模ビデオ-テキストデータセット

要旨

本論文では、大規模なビデオ中心のマルチモーダルデータセットであるInternVidを紹介する。このデータセットは、強力で転移可能なビデオ-テキスト表現を学習し、マルチモーダル理解と生成を可能にする。InternVidデータセットは、760K時間に及ぶ700万本以上のビデオを含み、234Mのビデオクリップと合計4.1B語の詳細な説明を提供する。我々の核心的な貢献は、大規模言語モデル（LLM）を用いて高品質なビデオ-テキストデータセットを自律的に構築するスケーラブルなアプローチを開発し、大規模なビデオ-言語表現学習の有効性を示すことである。具体的には、マルチスケールアプローチを活用してビデオ関連の説明を生成する。さらに、ViT-Lに基づくビデオ-テキスト表現学習モデルであるViCLIPを導入する。このモデルは、InternVidでコントラスティブ学習を行い、ゼロショット行動認識においてリーダーシップを発揮し、競争力のあるビデオ検索性能を示す。認識や検索といった基本的なビデオ理解タスクを超えて、我々のデータセットとモデルは幅広い応用が可能である。特に、ビデオ中心の対話システムを学習するためのインタリーブされたビデオ-テキストデータの生成、ビデオからテキストへの生成やテキストからビデオへの生成研究の進展に特に有益である。これらの提案リソースは、マルチモーダルビデオ理解と生成に興味を持つ研究者や実務家にとって有用なツールを提供する。

English

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

InternVid: マルチモーダル理解と生成のための大規模ビデオ-テキストデータセット

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

要旨

Support