ChatPaper.aiChatPaper

VidVec:释放视频多模态大模型嵌入在视频-文本检索中的潜力

VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

February 8, 2026
作者: Issar Tzachor, Dvir Samuel, Rami Ben-Ari
cs.AI

摘要

近期研究通过微调生成式多模态大语言模型(MLLMs)将其改造为视觉任务的嵌入提取器,以生成通用表征。然而其在视频任务上的表现仍逊色于视频基础模型(VFMs)。本文聚焦于利用MLLMs实现视频-文本嵌入与检索。我们首先进行系统性的分层分析,发现中间层(预训练状态)的MLLMs已编码大量任务相关信息。基于此洞见,我们证明将中间层嵌入与校准后的MLLM头部相结合,无需训练即可实现强大的零样本检索性能。在此基础上,我们提出一种轻量级文本对齐策略:通过将稠密视频描述映射为简短摘要,无需视觉监督即可实现任务相关的视频-文本嵌入学习。值得注意的是,仅通过文本微调,我们的方法就以显著优势超越现有方案,在主流视频检索基准测试中达到最先进水平。
English
Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.
PDF91February 13, 2026