在数百万个视频中提炼视觉语言模型

摘要

视觉-语言模型最近取得的进展主要归因于丰富的图像文本数据。我们的目标是为视频-语言模型复制这一成功，但目前可用的人工筛选的视频文本数据并不足够。因此，我们将从强大的图像-语言基线模型中微调一个视频-语言模型，并使用合成的指导数据。随后，生成的视频-语言模型用于自动为数百万视频生成高质量字幕。我们展示了经过调整的视频-语言模型在各种视频-语言基准测试中表现良好。例如，它在开放式NExT-QA基准测试中的表现超过了以往最佳结果2.8%。此外，我们的模型为以前未见过的视频生成了详细描述，提供了比现有方法更好的文本监督。实验证明，一个视频-语言双编码器模型，在对这些自动生成的字幕进行对比训练后，比那些也利用视觉-语言模型的最强基线模型提高了3.8%。我们的最佳模型在MSR-VTT零样本文本到视频检索任务上的表现超过了最先进方法6%。

English

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-language model is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

在数百万个视频中提炼视觉语言模型

Distilling Vision-Language Models on Millions of Videos

摘要

Support