ChatPaper.aiChatPaper

Jina CLIP:您的 CLIP 模型也是您的文本檢索器

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

May 30, 2024
作者: Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao
cs.AI

摘要

對比式語言-圖像預訓練(CLIP)被廣泛應用於訓練模型,將圖像和文本對齊到共同的嵌入空間,將它們映射為固定大小的向量。這些模型對於多模態信息檢索和相關任務至關重要。然而,與專門的文本模型相比,CLIP模型通常在僅文本任務中表現不佳。這導致信息檢索系統在保留獨立的嵌入和模型用於僅文本和多模態任務時存在效率問題。我們提出了一種新穎的多任務對比訓練方法來解決這個問題,我們使用該方法來訓練jina-clip-v1模型,在文本-圖像和文本-文本檢索任務上實現了最先進的性能。
English
Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

Summary

AI-Generated Summary

PDF371December 12, 2024