Jina CLIP：你的CLIP模型也是你的文本检索器

摘要

对比语言-图像预训练（CLIP）被广泛应用于训练模型，通过将图像和文本映射到固定大小的向量，使它们在一个共同的嵌入空间中对齐。这些模型对于多模态信息检索和相关任务至关重要。然而，与专门的文本模型相比，CLIP模型在纯文本任务中通常表现不佳。这导致信息检索系统需要为纯文本和多模态任务保留单独的嵌入和模型，从而造成低效。为解决这一问题，我们提出了一种新颖的多任务对比训练方法，用于训练jina-clip-v1模型，在文本-图像和文本-文本检索任务上实现了最先进的性能。

English

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

Jina CLIP：你的CLIP模型也是你的文本检索器

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

摘要

Support