E5-V：多模式大型語言模型的通用嵌入

摘要

多模式大型語言模型（MLLMs）在一般視覺和語言理解方面展示了令人期待的進展。然而，使用MLLMs表示多模式信息仍然是一個未被廣泛探索的領域。在這項工作中，我們引入了一個新的框架，名為E5-V，旨在適應MLLMs以實現通用多模式嵌入。我們的研究結果突顯了MLLMs在表示多模式輸入方面相對於先前方法的重要潛力。通過利用MLLMs與提示語，E5-V有效地彌合了不同類型輸入之間的模態差距，在多模式嵌入方面表現出強大的性能，即使在沒有進行微調的情況下也是如此。我們提出了一種E5-V的單模式訓練方法，其中模型僅在文本對上進行訓練。這種方法相對於傳統的圖像-文本對多模式訓練，顯示出明顯的改進，同時將訓練成本降低了約95%。此外，這種方法消除了昂貴的多模式訓練數據收集的需求。在四種任務中進行的大量實驗證明了E5-V的有效性。作為通用多模式模型，E5-V不僅實現了，而且通常超越了每個任務的最新性能，儘管它是在單一模式上進行訓練的。

English

Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains largely unexplored. In this work, we introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. Our findings highlight the significant potential of MLLMs in representing multimodal inputs compared to previous approaches. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%. Additionally, this approach eliminates the need for costly multimodal training data collection. Extensive experiments across four types of tasks demonstrate the effectiveness of E5-V. As a universal multimodal model, E5-V not only achieves but often surpasses state-of-the-art performance in each task, despite being trained on a single modality.

E5-V：多模式大型語言模型的通用嵌入

E5-V: Universal Embeddings with Multimodal Large Language Models

摘要

Support