ChatPaper.aiChatPaper

InternVideo2:擴展多模態視頻理解的視頻基礎模型

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

March 22, 2024
作者: Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang
cs.AI

摘要

我們介紹了InternVideo2,一個新的影片基礎模型(ViFM),在動作識別、影片文本任務和以影片為中心的對話中實現了最先進的性能。我們的方法採用了一種逐步訓練範式,統一了不同的自我監督或弱監督學習框架,包括遮罩式影片標記重建、跨模態對比學習和下一個標記預測。不同的訓練階段將引導我們的模型通過不同的預設任務捕捉不同層次的結構和語義信息。在數據層面上,我們通過對影片進行語義分割並生成影片-音頻-語音字幕,優先考慮時空一致性。這提高了影片和文本之間的對齊。我們為InternVideo2擴展了數據和模型大小。通過大量實驗,我們驗證了我們的設計並展示了在60多個影片和音頻任務上的最先進性能。值得注意的是,我們的模型在各種與影片相關的字幕、對話和長影片理解基準上優於其他模型,突顯了其推理和理解長時間上下文的能力。代碼和模型可在https://github.com/OpenGVLab/InternVideo2/找到。
English
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach employs a progressive training paradigm that unifies the different self- or weakly-supervised learning frameworks of masked video token reconstruction, cross-modal contrastive learning, and next token prediction. Different training stages would guide our model to capture different levels of structure and semantic information through different pretext tasks. At the data level, we prioritize the spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. We scale both data and model size for our InternVideo2. Through extensive experiments, we validate our designs and demonstrate the state-of-the-art performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related captioning, dialogue, and long video understanding benchmarks, highlighting its ability to reason and comprehend long temporal contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo2/.

Summary

AI-Generated Summary

PDF264December 15, 2024