수백만 개의 비디오에 대한 시각-언어 모델의 지식 증류

초록

비전-언어 모델의 최근 발전은 대량의 이미지-텍스트 데이터 덕분에 크게 이루어졌습니다. 우리는 이러한 성공을 비디오-언어 모델로 확장하고자 하지만, 인간이 직접 정리한 비디오-텍스트 데이터가 충분하지 않다는 문제에 직면했습니다. 따라서 우리는 강력한 이미지-언어 기반 모델을 합성된 지시 데이터로 미세 조정하여 비디오-언어 모델을 구축했습니다. 이렇게 적응된 비디오-언어 모델은 수백만 개의 비디오를 자동으로 레이블링하여 고품질 캡션을 생성하는 데 사용됩니다. 우리는 이 비디오-언어 모델이 다양한 비디오-언어 벤치마크에서 우수한 성능을 보인다는 것을 입증했습니다. 예를 들어, 개방형 NExT-QA에서 기존 최고 기록을 2.8% 상회했습니다. 또한, 우리의 모델은 이전에 본 적 없는 비디오에 대해 상세한 설명을 생성하며, 이는 기존 방법보다 더 나은 텍스트 감독을 제공합니다. 실험 결과, 이러한 자동 생성 캡션으로 대조 학습된 비디오-언어 이중 인코더 모델은 비전-언어 모델을 활용한 가장 강력한 베이스라인보다 3.8% 더 우수한 성능을 보였습니다. 우리의 최고 모델은 MSR-VTT 제로샷 텍스트-비디오 검색에서 최신 기술을 6% 앞섰습니다.

English

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-language model is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

수백만 개의 비디오에 대한 시각-언어 모델의 지식 증류

Distilling Vision-Language Models on Millions of Videos

초록

Support