ChatPaper.aiChatPaper

Panda-70M:使用多模教師為 70M 個影片加上標題

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

February 29, 2024
作者: Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov
cs.AI

摘要

數據和標註的質量上限了下游模型的質量。儘管存在大量的文本語料庫和圖像-文本對,但高質量的視訊-文本數據卻難以收集。首先,手動標記更耗時,因為需要標註者觀看整個視訊。其次,視頻具有時間維度,由多個場景堆疊在一起,展示多個動作。因此,為了建立具有高質量標題的視頻數據集,我們提出了一種自動方法,利用多模態輸入,如文本視頻描述、字幕和單獨的視頻幀。具體而言,我們從公開可用的高清-VILA-100M數據集中精選了380萬高分辨率視頻。然後,我們將它們分割成語義一致的視頻片段,並應用多個跨模態教師模型來為每個視頻獲取標題。接下來,我們在一個小子集上對檢索模型進行微調,在這個子集中手動選擇每個視頻的最佳標題,然後在整個數據集中使用該模型來選擇最佳標題作為標註。通過這種方式,我們獲得了7000萬個與高質量文本標題配對的視頻。我們將該數據集命名為Panda-70M。我們展示了所提出數據集在三個下游任務上的價值:視頻標題生成、視頻和文本檢索以及文本驅動的視頻生成。在所提出的數據上訓練的模型在所有任務的大多數指標上得分顯著更好。
English
The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.
PDF353December 15, 2024