DiscoVLA: ビジョン、言語、およびアラインメントにおける不一致低減によるパラメータ効率的なビデオ-テキスト検索

要旨

画像テキスト事前学習モデルCLIPのビデオテキスト検索へのパラメータ効率的な適応は、重要な研究領域である。CLIPは画像レベルの視覚言語マッチングに焦点を当てているが、ビデオテキスト検索ではビデオレベルの包括的な理解が求められる。画像レベルからビデオレベルへの転移において、視覚、言語、およびアラインメントの3つの主要な不一致が生じる。しかし、既存の手法は主に視覚に焦点を当てており、言語とアラインメントを軽視している。本論文では、視覚、言語、およびアラインメントの不一致を同時に軽減するDiscrepancy Reduction in Vision, Language, and Alignment (DiscoVLA)を提案する。具体的には、画像レベルとビデオレベルの特徴を統合するImage-Video Features Fusionを導入し、視覚と言語の不一致を効果的に解決する。さらに、細粒度の画像レベルアラインメントを学習するために、疑似画像キャプションを生成する。アラインメントの不一致を軽減するために、画像レベルのアラインメント知識を活用してビデオレベルのアラインメントを強化するImage-to-Video Alignment Distillationを提案する。広範な実験により、DiscoVLAの優位性が実証された。特に、CLIP (ViT-B/16)を用いたMSRVTTにおいて、DiscoVLAは従来の手法をR@1で1.5%上回り、最終スコアとして50.5% R@1を達成した。コードはhttps://github.com/LunarShen/DsicoVLAで公開されている。

English

The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.

DiscoVLA: ビジョン、言語、およびアラインメントにおける不一致低減によるパラメータ効率的なビデオ-テキスト検索

DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

要旨

Support