Youku-mPLUG：一个用于预训练和基准测试的1000万规模的中文视频语言数据集

摘要

为促进视觉-语言预训练（VLP）和多模态大型语言模型（LLM）在中国社区的发展，我们首次发布了最大的公开中文高质量视频-语言数据集，名为优酷-mPLUG。该数据集从知名的中国视频分享网站优酷采集而来，严格遵循安全、多样性和质量标准。优酷-mPLUG 包含了从 45 个不同类别的 4 亿原始视频中筛选出的 1000 万个中文视频-文本对，用于大规模预训练。此外，为促进视频-语言模型的全面评估，我们精心构建了最大的人工标注中文基准数据集，涵盖跨模态检索、视频字幕生成和视频分类三种热门视频-语言任务。优酷-mPLUG 能够帮助研究人员开展更深入的多模态研究，并在未来开发更好的应用程序。此外，我们发布了流行的视频-语言预训练模型 ALPRO 和 mPLUG-2，以及我们提出的在优酷-mPLUG 上预训练的模块化解码器模型 mPLUG-video。实验表明，在优酷-mPLUG 上预训练的模型在视频分类任务上取得了高达 23.1% 的改进。此外，mPLUG-video 在这些基准数据集上实现了新的最先进结果，视频分类任务中的 top-1 准确率达到 80.5%，视频字幕生成任务中的 CIDEr 分数为 68.9。最后，我们基于冻结的 Bloomz 将 mPLUG-video 扩展为仅具有 1.7% 可训练参数的中文多模态LLM，并展示了令人印象深刻的指令和视频理解能力。零样本指令理解实验表明，使用优酷-mPLUG 进行预训练可以增强理解整体和详细视觉语义、识别场景文本以及利用开放领域知识的能力。

English

To promote the development of Vision-Language Pre-training (VLP) and multimodal Large Language Model (LLM) in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training. In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification. Youku-mPLUG can enable researchers to conduct more in-depth multimodal research and develop better applications in the future. Furthermore, we release popular video-language pre-training models, ALPRO and mPLUG-2, and our proposed modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG. Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1% improvement in video category classification. Besides, mPLUG-video achieves a new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in video category classification and 68.9 CIDEr score in video captioning, respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate impressive instruction and video understanding ability. The zero-shot instruction understanding experiment indicates that pretraining with Youku-mPLUG can enhance the ability to comprehend overall and detailed visual semantics, recognize scene text, and leverage open-domain knowledge.

Youku-mPLUG：一个用于预训练和基准测试的1000万规模的中文视频语言数据集

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

摘要

Support