InternVid：一個用於多模式理解和生成的大規模視頻文本數據集

摘要

本文介紹了InternVid，一個大規模以影片為中心的多模態資料集，可用於學習強大且可轉移的影片-文字表示，以進行多模態理解和生成。InternVid資料集包含超過700萬個影片，總計近760,000小時，產生了2.34億個影片片段，並附有總計41億字的詳細描述。我們的核心貢獻在於開發一種可擴展的方法，自主構建具有大型語言模型（LLM）的高質量影片-文字資料集，從而展示其在大規模學習影片-語言表示方面的效力。具體而言，我們利用多尺度方法生成與影片相關的描述。此外，我們介紹了ViCLIP，一個基於ViT-L的影片-文字表示學習模型。通過對InternVid進行對比學習，該模型展示了領先的零樣本動作識別和具有競爭力的影片檢索性能。除了基本的影片理解任務，如識別和檢索，我們的資料集和模型還具有廣泛的應用。它們對於生成交錯的影片-文字數據以學習影片中心對話系統，推進從影片到文字和從文字到影片的生成研究特別有益。這些提議的資源為對多模態影片理解和生成感興趣的研究人員和從業者提供了一個工具。

English

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

InternVid：一個用於多模式理解和生成的大規模視頻文本數據集

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

摘要

Support