Youku-mPLUG：事前学習とベンチマークのための1000万規模の中国語ビデオ言語データセット

要旨

中国コミュニティにおけるVision-Language Pre-training（VLP）とマルチモーダル大規模言語モデル（LLM）の発展を促進するため、我々はまず、中国最大の公開高品質ビデオ言語データセット「Youku-mPLUG」をリリースしました。このデータセットは、中国で有名なビデオ共有サイトであるYoukuから、安全性、多様性、品質の厳格な基準に基づいて収集されています。Youku-mPLUGは、45の多様なカテゴリーにわたる4億の生ビデオからフィルタリングされた1000万の中国語ビデオテキストペアを含み、大規模な事前学習に適しています。さらに、ビデオ言語モデルの包括的な評価を容易にするため、クロスモーダル検索、ビデオキャプショニング、ビデオカテゴリー分類という3つの人気のあるビデオ言語タスクをカバーする、最大の人間注釈付き中国語ベンチマークを慎重に構築しました。Youku-mPLUGは、研究者がより深いマルチモーダル研究を行い、将来のより良いアプリケーションを開発することを可能にします。さらに、人気のあるビデオ言語事前学習モデルであるALPROとmPLUG-2、および我々が提案するモジュール化されたデコーダのみのモデルmPLUG-videoをYouku-mPLUGで事前学習させてリリースしました。実験では、Youku-mPLUGで事前学習したモデルがビデオカテゴリー分類で最大23.1%の改善を示しました。また、mPLUG-videoは、ビデオカテゴリー分類で80.5%のトップ1精度、ビデオキャプショニングで68.9のCIDErスコアを達成し、これらのベンチマークで新たな最先端の結果を達成しました。最後に、我々は、凍結されたBloomzに基づいてmPLUG-videoをスケールアップし、わずか1.7%の学習可能なパラメータを持つ中国語マルチモーダルLLMとして、印象的な指示とビデオ理解能力を示しました。ゼロショット指示理解実験は、Youku-mPLUGでの事前学習が、全体的および詳細な視覚的セマンティクスの理解、シーンテキストの認識、オープンドメイン知識の活用能力を向上させることを示しています。

English

To promote the development of Vision-Language Pre-training (VLP) and multimodal Large Language Model (LLM) in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training. In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification. Youku-mPLUG can enable researchers to conduct more in-depth multimodal research and develop better applications in the future. Furthermore, we release popular video-language pre-training models, ALPRO and mPLUG-2, and our proposed modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG. Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1% improvement in video category classification. Besides, mPLUG-video achieves a new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in video category classification and 68.9 CIDEr score in video captioning, respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate impressive instruction and video understanding ability. The zero-shot instruction understanding experiment indicates that pretraining with Youku-mPLUG can enhance the ability to comprehend overall and detailed visual semantics, recognize scene text, and leverage open-domain knowledge.

Youku-mPLUG：事前学習とベンチマークのための1000万規模の中国語ビデオ言語データセット

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

要旨

Support