ChatPaper.aiChatPaper

语境投票:将视觉语言模型转化为零样本排序融合器

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

November 3, 2025
作者: Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan
cs.AI

摘要

在检索领域,如何融合异构检索器生成的候选结果是一个长期存在的挑战,尤其对于视频这类复杂的多模态数据。典型的融合技术无需训练,但仅依赖排序或分数信号,忽略了候选结果的表征信息。本研究提出上下文投票(ViC)框架,这是一种无需训练的通用方法,将列表式重排序与融合重新定义为视觉语言模型的零样本推理任务。其核心洞见在于将内容证据和检索器元数据直接序列化嵌入VLM提示中,使模型能自适应权衡检索器共识与视觉-语言内容的关系。我们通过将该框架应用于跨模态视频检索这一挑战性领域,验证其普适性。为此,我们引入了S-Grid——一种紧凑的序列化映射表,将每个视频表示为图像网格,并可选择搭配字幕以实现对视频候选集的列表式推理。评估表明,ViC作为单列表重排序器能显著提升个体检索器的精确度,作为集成融合器则持续超越CombSUM等强基线方法。在ActivityNet和VATEX等视频检索基准测试中,该框架实现了零样本检索性能的最新突破,展现了其处理复杂视觉、时序信号与文本协同能力的有效性。在零样本设置下,ViC在MSR-VTT数据集上达到87.1%(文本到视频)/89.0%(视频到文本)的Recall@1分数,在VATEX数据集上实现99.6%(视频到文本)的Recall@1,较之前最优基线提升高达+40个Recall@1点。我们呈现的ViC作为一种简单、可复现且高效的方案,能将现代VLM转化为强大的零样本重排序与融合工具。代码与资源已开源:https://github.com/mohammad2012191/ViC
English
In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC
PDF21January 19, 2026