Youku-mPLUG:一个用于预训练和基准测试的1000万规模的中文视频语言数据集
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
June 7, 2023
作者: Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang
cs.AI
摘要
为促进视觉-语言预训练(VLP)和多模态大型语言模型(LLM)在中国社区的发展,我们首次发布了最大的公开中文高质量视频-语言数据集,名为优酷-mPLUG。该数据集从知名的中国视频分享网站优酷采集而来,严格遵循安全、多样性和质量标准。优酷-mPLUG 包含了从 45 个不同类别的 4 亿原始视频中筛选出的 1000 万个中文视频-文本对,用于大规模预训练。此外,为促进视频-语言模型的全面评估,我们精心构建了最大的人工标注中文基准数据集,涵盖跨模态检索、视频字幕生成和视频分类三种热门视频-语言任务。优酷-mPLUG 能够帮助研究人员开展更深入的多模态研究,并在未来开发更好的应用程序。此外,我们发布了流行的视频-语言预训练模型 ALPRO 和 mPLUG-2,以及我们提出的在优酷-mPLUG 上预训练的模块化解码器模型 mPLUG-video。实验表明,在优酷-mPLUG 上预训练的模型在视频分类任务上取得了高达 23.1% 的改进。此外,mPLUG-video 在这些基准数据集上实现了新的最先进结果,视频分类任务中的 top-1 准确率达到 80.5%,视频字幕生成任务中的 CIDEr 分数为 68.9。最后,我们基于冻结的 Bloomz 将 mPLUG-video 扩展为仅具有 1.7% 可训练参数的中文多模态LLM,并展示了令人印象深刻的指令和视频理解能力。零样本指令理解实验表明,使用优酷-mPLUG 进行预训练可以增强理解整体和详细视觉语义、识别场景文本以及利用开放领域知识的能力。
English
To promote the development of Vision-Language Pre-training (VLP) and
multimodal Large Language Model (LLM) in the Chinese community, we firstly
release the largest public Chinese high-quality video-language dataset named
Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing
website, with strict criteria of safety, diversity, and quality. Youku-mPLUG
contains 10 million Chinese video-text pairs filtered from 400 million raw
videos across a wide range of 45 diverse categories for large-scale
pre-training. In addition, to facilitate a comprehensive evaluation of
video-language models, we carefully build the largest human-annotated Chinese
benchmarks covering three popular video-language tasks of cross-modal
retrieval, video captioning, and video category classification. Youku-mPLUG can
enable researchers to conduct more in-depth multimodal research and develop
better applications in the future. Furthermore, we release popular
video-language pre-training models, ALPRO and mPLUG-2, and our proposed
modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG.
Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1%
improvement in video category classification. Besides, mPLUG-video achieves a
new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in
video category classification and 68.9 CIDEr score in video captioning,
respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with
only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate
impressive instruction and video understanding ability. The zero-shot
instruction understanding experiment indicates that pretraining with
Youku-mPLUG can enhance the ability to comprehend overall and detailed visual
semantics, recognize scene text, and leverage open-domain knowledge.