텍스트-비디오 검색을 위한 다중 모달 대형 언어 모델 기반 양방향 가능성 추정

초록

텍스트-비디오 검색(Text-Video Retrieval)은 대규모 온라인 데이터베이스에서 비디오(또는 텍스트) 쿼리가 주어졌을 때 가장 관련성이 높은 텍스트(또는 비디오) 후보를 찾는 것을 목표로 합니다. 최근 연구에서는 다중 모달 대형 언어 모델(Multi-modal Large Language Models, MLLMs)을 활용하여 검색 성능을 개선하고 있으며, 특히 길거나 복잡한 쿼리-후보 쌍에서 더 나은 결과를 보여주고 있습니다. 그러나 MLLMs를 단순히 적용하는 방식, 즉 후보 가능도(candidate likelihood)를 기반으로 한 검색은 후보 사전 편향(candidate prior bias)을 유발하여, 쿼리와 더 관련성이 높은 후보보다 본질적으로 더 높은 사전 확률을 가진 후보를 선호하는 문제가 있습니다. 이를 해결하기 위해, 우리는 양방향 가능도 추정(Bidirectional Likelihood Estimation with MLLM, BLiM)이라는 새로운 검색 프레임워크를 제안합니다. BLiM은 주어진 비디오에서 텍스트를 생성하고, 주어진 텍스트에서 비디오 특징을 생성하도록 모델을 훈련시켜 쿼리와 후보의 가능도를 모두 활용합니다. 또한, 후보 사전 정규화(Candidate Prior Normalization, CPN)라는 간단하면서도 효과적인 훈련 없이 점수 보정을 수행하는 모듈을 도입하여 후보 가능도에서의 후보 사전 편향을 완화합니다. 네 가지 텍스트-비디오 검색 벤치마크에서, CPN이 적용된 BLiM은 기존 최첨단 모델들을 평균 6.4 R@1로 능가하며, 후보 사전 편향을 효과적으로 완화하고 쿼리-후보 관련성을 강조합니다. 검색을 넘어 다양한 다중 모달 작업에 대한 심층 분석을 통해, CPN이 텍스트 사전 확률에 대한 의존도를 줄여 시각적 이해를 향상시키는 광범위한 적용 가능성을 입증합니다. 코드는 https://github.com/mlvlab/BLiM에서 확인할 수 있습니다.

English

Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.

텍스트-비디오 검색을 위한 다중 모달 대형 언어 모델 기반 양방향 가능성 추정

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

초록

Support