DiscoVLA：面向参数高效视频-文本检索的视觉、语言与对齐差异缩减

摘要

图像-文本预训练模型CLIP在视频-文本检索中的参数高效适配是一个重要的研究领域。尽管CLIP专注于图像层面的视觉-语言匹配，但视频-文本检索则要求在视频层面具备全面的理解能力。从图像层面迁移至视频层面时，存在三个关键差异：视觉、语言和对齐。然而，现有方法主要关注视觉方面，而忽视了语言和对齐。本文提出了一种在视觉、语言和对齐三方面同时减少差异的方法——DiscoVLA。具体而言，我们引入了图像-视频特征融合技术，有效整合图像级与视频级特征，从而解决视觉与语言上的差异。此外，通过生成伪图像描述来学习细粒度的图像级对齐。为了缓解对齐差异，我们提出了图像到视频对齐蒸馏方法，利用图像级对齐知识来增强视频级对齐。大量实验证明了DiscoVLA的优越性。特别是在使用CLIP（ViT-B/16）的MSRVTT数据集上，DiscoVLA在R@1指标上超越了先前方法1.5%，最终达到了50.5%的R@1得分。代码已公开于https://github.com/LunarShen/DsicoVLA。

English

The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.

DiscoVLA：面向参数高效视频-文本检索的视觉、语言与对齐差异缩减

DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

摘要

Support