InternVideo2:为多模态视频理解扩展视频基础模型
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
March 22, 2024
作者: Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang
cs.AI
摘要
我们介绍InternVideo2,一种新的视频基础模型(ViFM),在动作识别、视频文本任务和以视频为中心的对话方面实现了最先进的性能。我们的方法采用渐进式训练范式,统一了不同的自监督或弱监督学习框架,包括遮罩视频标记重建、跨模态对比学习和下一个标记预测。不同的训练阶段将引导我们的模型通过不同的预训练任务捕获不同级别的结构和语义信息。在数据层面,我们通过语义分割视频并生成视频-音频-语音字幕来优先考虑时空一致性,从而提高视频和文本之间的对齐性。我们为InternVideo2扩展了数据和模型规模。通过大量实验,我们验证了我们的设计,并在60多个视频和音频任务上展示了最先进的性能。值得注意的是,我们的模型在各种与视频相关的字幕、对话和长视频理解基准上表现优异,突显了其推理和理解长时序上下文的能力。代码和模型可在https://github.com/OpenGVLab/InternVideo2/ 获取。
English
We introduce InternVideo2, a new video foundation model (ViFM) that achieves
the state-of-the-art performance in action recognition, video-text tasks, and
video-centric dialogue. Our approach employs a progressive training paradigm
that unifies the different self- or weakly-supervised learning frameworks of
masked video token reconstruction, cross-modal contrastive learning, and next
token prediction. Different training stages would guide our model to capture
different levels of structure and semantic information through different
pretext tasks. At the data level, we prioritize the spatiotemporal consistency
by semantically segmenting videos and generating video-audio-speech captions.
This improves the alignment between video and text. We scale both data and
model size for our InternVideo2. Through extensive experiments, we validate our
designs and demonstrate the state-of-the-art performance on over 60 video and
audio tasks. Notably, our model outperforms others on various video-related
captioning, dialogue, and long video understanding benchmarks, highlighting its
ability to reason and comprehend long temporal contexts. Code and models are
available at https://github.com/OpenGVLab/InternVideo2/.Summary
AI-Generated Summary