VLM2Vec-V2:推进视频、图像及视觉文档的多模态嵌入技术
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
July 7, 2025
作者: Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz
cs.AI
摘要
多模態嵌入模型在實現語義相似性、信息檢索及跨模態聚類等多種下游任務中發揮了關鍵作用。然而,現有的多模態嵌入模型如VLM2Vec、E5-V、GME等主要集中於自然圖像,對視頻及視覺文檔等其他視覺形式的支持有限,這限制了它們在AI代理、多模態搜索與推薦、以及檢索增強生成(RAG)等實際場景中的應用。為彌補這一不足,我們提出了VLM2Vec-V2,一個跨多樣視覺形式的統一嵌入學習框架。首先,我們引入了MMEB-V2,這是一個擴展了MMEB的綜合基準,新增了五種任務類型:視覺文檔檢索、視頻檢索、時間定位、視頻分類及視頻問答——涵蓋文本、圖像、視頻及視覺文檔輸入。隨後,我們訓練了VLM2Vec-V2,這是一個支持文本、圖像、視頻及視覺文檔輸入的通用嵌入模型。大量實驗表明,VLM2Vec-V2不僅在新引入的視頻與文檔檢索任務上表現出色,還在原有圖像基準上超越了先前的基線。通過廣泛評估,我們的研究揭示了多種多模態嵌入模型的泛化能力,並強調了統一嵌入學習的有效策略,為研究與實際應用中更具可擴展性與適應性的表示學習奠定了基礎。
English
Multimodal embedding models have been crucial in enabling various downstream
tasks such as semantic similarity, information retrieval, and clustering over
different modalities. However, existing multimodal embeddings like VLM2Vec,
E5-V, GME are predominantly focused on natural images, with limited support for
other visual forms such as videos and visual documents. This restricts their
applicability in real-world scenarios, including AI agents, multi-modal search
and recommendation, and retrieval-augmented generation (RAG). To close this
gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across
diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark
that extends MMEB with five new task types: visual document retrieval, video
retrieval, temporal grounding, video classification and video question
answering - spanning text, image, video, and visual document inputs. Next, we
train VLM2Vec-V2, a general-purpose embedding model that supports text, image,
video, and visual document inputs. Extensive experiments show that VLM2Vec-V2
achieves strong performance not only on the newly introduced video and document
retrieval tasks, but also improves over prior baselines on the original image
benchmarks. Through extensive evaluation, our study offers insights into the
generalizability of various multimodal embedding models and highlights
effective strategies for unified embedding learning, laying the groundwork for
more scalable and adaptable representation learning in both research and
real-world settings.