Panda-70M:使用多模教师为7000万视频添加标题
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
February 29, 2024
作者: Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov
cs.AI
摘要
数据和标注的质量上限了下游模型的质量。虽然存在大量的文本语料库和图像文本对,但高质量的视频文本数据更难收集。首先,手动标注更耗时,因为需要注释者观看整个视频。其次,视频具有时间维度,由多个场景堆叠在一起,并展示多个动作。因此,为了建立具有高质量字幕的视频数据集,我们提出了一种自动方法,利用多模态输入,如文本视频描述、字幕和单独的视频帧。具体而言,我们从公开可用的高清视频库HD-VILA-100M中策划了380万个高分辨率视频。然后,我们将它们分割成语义一致的视频片段,并应用多个跨模态教师模型为每个视频获取字幕。接下来,我们在一个小子集上微调检索模型,手动选择每个视频的最佳字幕,然后在整个数据集中使用该模型选择最佳字幕作为标注。通过这种方式,我们获得了7000万个视频,配有高质量文本字幕。我们将该数据集称为Panda-70M。我们展示了所提出数据集在三个下游任务上的价值:视频字幕生成、视频和文本检索以及文本驱动的视频生成。在所提出数据上训练的模型在所有任务的大多数指标上得分明显更好。
English
The quality of the data and annotation upper-bounds the quality of a
downstream model. While there exist large text corpora and image-text pairs,
high-quality video-text data is much harder to collect. First of all, manual
labeling is more time-consuming, as it requires an annotator to watch an entire
video. Second, videos have a temporal dimension, consisting of several scenes
stacked together, and showing multiple actions. Accordingly, to establish a
video dataset with high-quality captions, we propose an automatic approach
leveraging multimodal inputs, such as textual video description, subtitles, and
individual video frames. Specifically, we curate 3.8M high-resolution videos
from the publicly available HD-VILA-100M dataset. We then split them into
semantically consistent video clips, and apply multiple cross-modality teacher
models to obtain captions for each video. Next, we finetune a retrieval model
on a small subset where the best caption of each video is manually selected and
then employ the model in the whole dataset to select the best caption as the
annotation. In this way, we get 70M videos paired with high-quality text
captions. We dub the dataset as Panda-70M. We show the value of the proposed
dataset on three downstream tasks: video captioning, video and text retrieval,
and text-driven video generation. The models trained on the proposed data score
substantially better on the majority of metrics across all the tasks.