Youku-mPLUG：一個包含1000萬條中文視頻語言數據的大規模數據集，用於預訓練和基準測試。

摘要

為促進視覺語言預訓練（VLP）和多模態大型語言模型（LLM）在中文社群中的發展，我們首次釋出最大的公開中文高質量視頻語言數據集，名為Youku-mPLUG，該數據集從知名的中文視頻分享網站Youku收集而來，具備嚴格的安全、多樣性和質量標準。Youku-mPLUG 包含從 4 億個原始視頻中篩選出的 1 千萬個中文視頻文本對，涵蓋 45 個不同類別的廣泛視頻，用於大規模預訓練。此外，為了促進對視頻語言模型的全面評估，我們精心構建了最大的人工標註中文基準數據集，涵蓋跨模態檢索、視頻字幕生成和視頻分類等三個熱門視頻語言任務。Youku-mPLUG 能夠讓研究人員進行更深入的多模態研究，並在未來開發更好的應用。此外，我們釋出了 ALPRO 和 mPLUG-2 兩款熱門視頻語言預訓練模型，以及我們提出的在Youku-mPLUG上預訓練的模塊化僅解碼器模型 mPLUG-video。實驗表明，在Youku-mPLUG上預訓練的模型在視頻分類方面取得了高達 23.1% 的改進。此外，mPLUG-video 在這些基準數據集上實現了新的最先進結果，視頻分類準確率達到 80.5%，視頻字幕生成的 CIDEr 分數達到 68.9%。最後，我們基於凍結的 Bloomz 將 mPLUG-video 擴展為僅具有 1.7% 可訓練參數的中文多模態LLM，展示出令人印象深刻的指導和視頻理解能力。零-shot 指導理解實驗表明，使用Youku-mPLUG進行預訓練可以增強對整體和詳細視覺語義、識別場景文本以及利用開放領域知識的能力。

English

To promote the development of Vision-Language Pre-training (VLP) and multimodal Large Language Model (LLM) in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training. In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification. Youku-mPLUG can enable researchers to conduct more in-depth multimodal research and develop better applications in the future. Furthermore, we release popular video-language pre-training models, ALPRO and mPLUG-2, and our proposed modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG. Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1% improvement in video category classification. Besides, mPLUG-video achieves a new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in video category classification and 68.9 CIDEr score in video captioning, respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate impressive instruction and video understanding ability. The zero-shot instruction understanding experiment indicates that pretraining with Youku-mPLUG can enhance the ability to comprehend overall and detailed visual semantics, recognize scene text, and leverage open-domain knowledge.

Youku-mPLUG：一個包含1000萬條中文視頻語言數據的大規模數據集，用於預訓練和基準測試。

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

摘要

Support