数百万のビデオを用いた視覚言語モデルの蒸留

要旨

ビジョン・ランゲージモデルの最近の進歩は、画像とテキストのデータの豊富さに大きく起因しています。私たちは、この成功をビデオ・ランゲージモデルにも再現することを目指していますが、人間がキュレーションしたビデオとテキストのデータが十分に存在しません。そこで、合成された指示データを用いて、強力な画像・ランゲージベースラインからビデオ・ランゲージモデルをファインチューニングすることにしました。その結果得られたビデオ・ランゲージモデルは、数百万のビデオを自動ラベリングし、高品質なキャプションを生成するために使用されます。適応されたビデオ・ランゲージモデルは、幅広いビデオ・ランゲージベンチマークで良好なパフォーマンスを示します。例えば、オープンエンドのNExT-QAにおいて、これまでの最高記録を2.8%上回りました。さらに、私たちのモデルは、以前に見たことのないビデオに対して詳細な説明を生成し、既存の手法よりも優れたテキストの監督を提供します。実験結果によると、これらの自動生成キャプションに対してコントラスティブにトレーニングされたビデオ・ランゲージデュアルエンコーダモデルは、ビジョン・ランゲージモデルを活用した最強のベースラインよりも3.8%優れています。私たちの最良のモデルは、MSR-VTTのゼロショットテキスト・トゥ・ビデオ検索において、最先端の手法を6%上回りました。

English

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-language model is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

数百万のビデオを用いた視覚言語モデルの蒸留

Distilling Vision-Language Models on Millions of Videos

要旨

Support