Youku-mPLUG:一個包含1000萬條中文視頻語言數據的大規模數據集,用於預訓練和基準測試。
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
June 7, 2023
作者: Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang
cs.AI
摘要
為促進視覺語言預訓練(VLP)和多模態大型語言模型(LLM)在中文社群中的發展,我們首次釋出最大的公開中文高質量視頻語言數據集,名為Youku-mPLUG,該數據集從知名的中文視頻分享網站Youku收集而來,具備嚴格的安全、多樣性和質量標準。Youku-mPLUG 包含從 4 億個原始視頻中篩選出的 1 千萬個中文視頻文本對,涵蓋 45 個不同類別的廣泛視頻,用於大規模預訓練。此外,為了促進對視頻語言模型的全面評估,我們精心構建了最大的人工標註中文基準數據集,涵蓋跨模態檢索、視頻字幕生成和視頻分類等三個熱門視頻語言任務。Youku-mPLUG 能夠讓研究人員進行更深入的多模態研究,並在未來開發更好的應用。此外,我們釋出了 ALPRO 和 mPLUG-2 兩款熱門視頻語言預訓練模型,以及我們提出的在Youku-mPLUG上預訓練的模塊化僅解碼器模型 mPLUG-video。實驗表明,在Youku-mPLUG上預訓練的模型在視頻分類方面取得了高達 23.1% 的改進。此外,mPLUG-video 在這些基準數據集上實現了新的最先進結果,視頻分類準確率達到 80.5%,視頻字幕生成的 CIDEr 分數達到 68.9%。最後,我們基於凍結的 Bloomz 將 mPLUG-video 擴展為僅具有 1.7% 可訓練參數的中文多模態LLM,展示出令人印象深刻的指導和視頻理解能力。零-shot 指導理解實驗表明,使用Youku-mPLUG進行預訓練可以增強對整體和詳細視覺語義、識別場景文本以及利用開放領域知識的能力。
English
To promote the development of Vision-Language Pre-training (VLP) and
multimodal Large Language Model (LLM) in the Chinese community, we firstly
release the largest public Chinese high-quality video-language dataset named
Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing
website, with strict criteria of safety, diversity, and quality. Youku-mPLUG
contains 10 million Chinese video-text pairs filtered from 400 million raw
videos across a wide range of 45 diverse categories for large-scale
pre-training. In addition, to facilitate a comprehensive evaluation of
video-language models, we carefully build the largest human-annotated Chinese
benchmarks covering three popular video-language tasks of cross-modal
retrieval, video captioning, and video category classification. Youku-mPLUG can
enable researchers to conduct more in-depth multimodal research and develop
better applications in the future. Furthermore, we release popular
video-language pre-training models, ALPRO and mPLUG-2, and our proposed
modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG.
Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1%
improvement in video category classification. Besides, mPLUG-video achieves a
new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in
video category classification and 68.9 CIDEr score in video captioning,
respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with
only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate
impressive instruction and video understanding ability. The zero-shot
instruction understanding experiment indicates that pretraining with
Youku-mPLUG can enhance the ability to comprehend overall and detailed visual
semantics, recognize scene text, and leverage open-domain knowledge.