VLM2Vec-V2：推进视频、图像及视觉文档的多模态嵌入技术

摘要

多模態嵌入模型在實現語義相似性、信息檢索及跨模態聚類等多種下游任務中發揮了關鍵作用。然而，現有的多模態嵌入模型如VLM2Vec、E5-V、GME等主要集中於自然圖像，對視頻及視覺文檔等其他視覺形式的支持有限，這限制了它們在AI代理、多模態搜索與推薦、以及檢索增強生成（RAG）等實際場景中的應用。為彌補這一不足，我們提出了VLM2Vec-V2，一個跨多樣視覺形式的統一嵌入學習框架。首先，我們引入了MMEB-V2，這是一個擴展了MMEB的綜合基準，新增了五種任務類型：視覺文檔檢索、視頻檢索、時間定位、視頻分類及視頻問答——涵蓋文本、圖像、視頻及視覺文檔輸入。隨後，我們訓練了VLM2Vec-V2，這是一個支持文本、圖像、視頻及視覺文檔輸入的通用嵌入模型。大量實驗表明，VLM2Vec-V2不僅在新引入的視頻與文檔檢索任務上表現出色，還在原有圖像基準上超越了先前的基線。通過廣泛評估，我們的研究揭示了多種多模態嵌入模型的泛化能力，並強調了統一嵌入學習的有效策略，為研究與實際應用中更具可擴展性與適應性的表示學習奠定了基礎。

English

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

VLM2Vec-V2：推进视频、图像及视觉文档的多模态嵌入技术

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

摘要

Support