ChatPaper.aiChatPaper

VLM2Vec-V2:推进视频、图像及视觉文档的多模态嵌入技术

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

July 7, 2025
作者: Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz
cs.AI

摘要

多模态嵌入模型在实现语义相似性、信息检索和跨模态聚类等下游任务中发挥着关键作用。然而,现有的多模态嵌入模型如VLM2Vec、E5-V、GME主要聚焦于自然图像,对其他视觉形式如视频和视觉文档的支持有限。这限制了它们在现实场景中的应用,包括AI代理、多模态搜索与推荐以及检索增强生成(RAG)。为填补这一空白,我们提出了VLM2Vec-V2,一个统一的学习跨多样视觉形式嵌入的框架。首先,我们引入了MMEB-V2,这是一个扩展了MMEB的综合性基准,新增了五种任务类型:视觉文档检索、视频检索、时间定位、视频分类和视频问答——涵盖文本、图像、视频和视觉文档输入。接着,我们训练了VLM2Vec-V2,一个支持文本、图像、视频和视觉文档输入的通用嵌入模型。大量实验表明,VLM2Vec-V2不仅在新增的视频和文档检索任务上表现出色,还在原有图像基准上超越了先前的基线模型。通过广泛评估,我们的研究揭示了多种多模态嵌入模型的泛化能力,并强调了统一嵌入学习的有效策略,为研究和实际应用中更可扩展、适应性更强的表示学习奠定了基础。
English
Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.
PDF41July 8, 2025