基於多模態大語言模型的雙向似然估計於文本-視頻檢索之應用

摘要

文本-視頻檢索旨在從大規模線上數據庫中，根據給定的視頻（或文本）查詢，找到最相關的文本（或視頻）候選項。近期研究利用多模態大型語言模型（MLLMs）來提升檢索效果，特別是針對長篇或複雜的查詢-候選對。然而，我們觀察到，直接應用MLLMs，即基於候選項似然度的檢索，會引入候選項先驗偏差，傾向於那些本身具有較高先驗概率的候選項，而非與查詢更相關的選項。為此，我們提出了一種新穎的檢索框架——基於MLLM的雙向似然估計（BLiM），該框架通過訓練模型從給定視頻生成文本以及從給定文本生成視頻特徵，來同時利用查詢和候選項的似然度。此外，我們引入了候選項先驗歸一化（CPN），這是一個簡單而有效的無訓練分數校準模塊，旨在減輕候選項似然度中的候選項先驗偏差。在四個文本-視頻檢索基準測試中，配備了CPN的BLiM平均在R@1指標上超越了之前的最先進模型6.4分，有效緩解了候選項先驗偏差，並強調了查詢-候選項的相關性。我們在檢索之外的多模態任務上的深入分析，凸顯了CPN的廣泛適用性，它通過減少對文本先驗的依賴，增強了視覺理解能力。代碼可在https://github.com/mlvlab/BLiM獲取。

English

Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.

基於多模態大語言模型的雙向似然估計於文本-視頻檢索之應用

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

摘要

Support