E5-V：多模态大语言模型的通用嵌入

摘要

多模态大型语言模型（MLLMs）在一般视觉和语言理解方面展现出了令人期待的进展。然而，利用MLLMs表示多模态信息的方法仍然鲜为人知。在这项工作中，我们引入了一个新框架，命名为E5-V，旨在调整MLLMs以实现通用多模态嵌入。我们的研究结果突显了MLLMs相较于先前方法在表示多模态输入方面的显著潜力。通过利用MLLMs与提示语，E5-V有效地弥合了不同类型输入之间的模态差距，在多模态嵌入方面表现出了强大的性能，甚至无需进行微调。我们提出了一种针对E5-V的单模态训练方法，其中模型仅在文本对上进行训练。这种方法相较于传统的基于图像-文本对的多模态训练，展现出了显著的改进，同时将训练成本降低了约95%。此外，这种方法消除了昂贵的多模态训练数据收集的需求。通过对四种类型任务的广泛实验，展示了E5-V的有效性。作为一种通用多模态模型，尽管仅在单一模态上进行训练，E5-V不仅实现了但经常超越了每项任务的最新性能。

English

Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains largely unexplored. In this work, we introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. Our findings highlight the significant potential of MLLMs in representing multimodal inputs compared to previous approaches. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%. Additionally, this approach eliminates the need for costly multimodal training data collection. Extensive experiments across four types of tasks demonstrate the effectiveness of E5-V. As a universal multimodal model, E5-V not only achieves but often surpasses state-of-the-art performance in each task, despite being trained on a single modality.

E5-V：多模态大语言模型的通用嵌入

E5-V: Universal Embeddings with Multimodal Large Language Models

摘要

Support