ChatPaper.aiChatPaper

投票上下文化:将视觉语言模型转化为零样本排序融合器

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

November 3, 2025
作者: Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan
cs.AI

摘要

在检索领域,异质检索器候选结果的融合一直是个长期挑战,尤其对于视频这类复杂的多模态数据。传统融合技术虽无需训练,但仅依赖排序或分数信号,忽略了候选结果的表征信息。本研究提出上下文投票(ViC)框架,这一通用化、免训练的方案将列表式重排序与融合重新定义为视觉语言模型的零样本推理任务。其核心洞见在于将内容证据与检索器元数据直接序列化嵌入VLM提示中,使模型能自适应权衡检索器共识与视觉-语言内容的关系。我们通过跨模态视频检索这一挑战性领域验证该框架的普适性,并引入S-Grid紧凑序列化图谱——将每个视频表示为可搭配字幕的图像网格,实现对视频候选集的列表式推理。ViC作为单列表重排序器时,能显著提升个体检索器的精确度;作为集成融合器时,其表现持续优于CombSUM等强基线。在ActivityNet、VATEX等视频检索基准测试中,该框架创造了零样本检索性能的新标杆,展现出处理复杂视觉、时序信号与文本的卓越能力。零样本设定下,ViC在MSR-VTT上达到87.1%(文本到视频)/89.0%(视频到文本)的Recall@1值,在VATEX上实现99.6%(视频到文本)的Recall@1值,较此前最优基线提升高达+40 Recall@1。我们提出ViC作为一种简洁、可复现的高效方案,能将现代VLM转化为强大的零样本重排序与融合工具。代码与资源已开源:https://github.com/mohammad2012191/ViC
English
In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC
PDF21January 19, 2026