InternVid:用于多模态理解和生成的大规模视频文本数据集
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
July 13, 2023
作者: Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, Yu Qiao
cs.AI
摘要
本文介绍了InternVid,这是一个大规模以视频为中心的多模态数据集,可用于学习强大且可转移的视频文本表示,以实现多模态理解和生成。InternVid数据集包含超过700万个视频,总时长近760,000小时,产生了2.34亿个视频剪辑,配有总计41亿字的详细描述。我们的核心贡献在于开发一种可扩展的方法,自主构建高质量的视频文本数据集,利用大型语言模型(LLM),从而展示其在大规模学习视频语言表示方面的有效性。具体来说,我们利用多尺度方法生成与视频相关的描述。此外,我们引入了基于ViT-L的ViCLIP,这是一种基于视频文本表示学习模型。通过对InternVid进行对比学习,该模型展示了领先的零样本动作识别和竞争性视频检索性能。除了识别和检索等基本视频理解任务之外,我们的数据集和模型具有广泛的应用。它们特别有利于生成交错的视频文本数据,用于学习以视频为中心的对话系统,推动视频到文本和文本到视频生成研究。这些提出的资源为对多模态视频理解和生成感兴趣的研究人员和从业者提供了工具。
English
This paper introduces InternVid, a large-scale video-centric multimodal
dataset that enables learning powerful and transferable video-text
representations for multimodal understanding and generation. The InternVid
dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M
video clips accompanied by detailed descriptions of total 4.1B words. Our core
contribution is to develop a scalable approach to autonomously build a
high-quality video-text dataset with large language models (LLM), thereby
showcasing its efficacy in learning video-language representation at scale.
Specifically, we utilize a multi-scale approach to generate video-related
descriptions. Furthermore, we introduce ViCLIP, a video-text representation
learning model based on ViT-L. Learned on InternVid via contrastive learning,
this model demonstrates leading zero-shot action recognition and competitive
video retrieval performance. Beyond basic video understanding tasks like
recognition and retrieval, our dataset and model have broad applications. They
are particularly beneficial for generating interleaved video-text data for
learning a video-centric dialogue system, advancing video-to-text and
text-to-video generation research. These proposed resources provide a tool for
researchers and practitioners interested in multimodal video understanding and
generation.