DiscoVLA：面向高效参数视频文本检索的视觉、语言与对齐差异缩减

摘要

圖像-文本預訓練模型CLIP在視頻-文本檢索中的參數高效適應是一個重要的研究領域。雖然CLIP專注於圖像層面的視覺-語言匹配，但視頻-文本檢索則要求在視頻層面進行全面理解。從圖像層面轉移到視頻層面時，出現了三個關鍵差異：視覺、語言和對齊。然而，現有方法主要關注視覺，而忽略了語言和對齊。本文提出了視覺、語言和對齊差異減少（DiscoVLA），同時緩解了這三個差異。具體而言，我們引入了圖像-視頻特徵融合，以整合圖像層面和視頻層面的特徵，有效解決視覺和語言差異。此外，我們生成偽圖像標題以學習細粒度的圖像層面對齊。為了緩解對齊差異，我們提出了圖像到視頻對齊蒸餾，利用圖像層面的對齊知識來增強視頻層面的對齊。大量實驗證明了我們DiscoVLA的優越性。特別是在使用CLIP（ViT-B/16）的MSRVTT數據集上，DiscoVLA在R@1上比之前的方法提高了1.5%，最終達到50.5%的R@1分數。代碼可在https://github.com/LunarShen/DsicoVLA獲取。

English

The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.

DiscoVLA：面向高效参数视频文本检索的视觉、语言与对齐差异缩减

DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

摘要

Support