在數百萬個影片中提煉視覺語言模型

摘要

近期視覺語言模型的進展主要歸因於豐富的圖像文字數據。我們的目標是為視頻語言模型複製這一成功，但目前人工精選的視頻文字數據並不足夠。因此，我們採用從強大的圖像語言基線模型中用合成指導數據進行微調的方法。最終得到的視頻語言模型用於自動標註數百萬個視頻，生成高質量的字幕。我們展示了適應後的視頻語言模型在各種視頻語言基準測試中表現良好。例如，它在開放式NExT-QA測試中超越了最佳先前結果2.8%。此外，我們的模型為以前未見過的視頻生成了詳細描述，提供比現有方法更好的文本監督。實驗表明，一個在這些自動生成字幕上對比訓練的視頻語言雙編碼模型比同樣利用視覺語言模型的最強基線模型提高了3.8%。我們的最佳模型在MSR-VTT零樣本文本到視頻檢索上比最先進的方法提高了6%。

English

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-language model is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

在數百萬個影片中提煉視覺語言模型

Distilling Vision-Language Models on Millions of Videos

摘要

Support