基于多模态大语言模型的双向似然估计在文本-视频检索中的应用

摘要

文本-视频检索旨在从大规模在线数据库中，根据给定的视频（或文本）查询，找到最相关的文本（或视频）候选。近期研究利用多模态大语言模型（MLLMs）提升检索效果，特别是针对长或复杂的查询-候选对。然而，我们观察到，直接应用MLLMs，即基于候选可能性的检索，会引入候选先验偏差，倾向于选择本身具有更高先验的候选，而非与查询更相关的那些。为此，我们提出了一种新颖的检索框架——基于MLLM的双向可能性估计（BLiM），该框架通过训练模型从给定视频生成文本以及从给定文本生成视频特征，同时利用查询和候选的可能性。此外，我们引入了候选先验归一化（CPN），这是一个简单但有效的无需训练的打分校准模块，旨在缓解候选可能性中的候选先验偏差。在四个文本-视频检索基准测试中，配备CPN的BLiM平均比之前的最先进模型高出6.4个R@1，有效减轻了候选先验偏差，并强调了查询-候选的相关性。我们在检索之外的多模态任务上的深入分析表明，CPN通过减少对文本先验的依赖，增强了视觉理解，展现了其广泛的适用性。代码可在https://github.com/mlvlab/BLiM获取。

English

Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.

基于多模态大语言模型的双向似然估计在文本-视频检索中的应用

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

摘要

Support