Jina CLIP:你的CLIP模型也是你的文本检索器
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
May 30, 2024
作者: Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao
cs.AI
摘要
对比语言-图像预训练(CLIP)被广泛应用于训练模型,通过将图像和文本映射到固定大小的向量,使它们在一个共同的嵌入空间中对齐。这些模型对于多模态信息检索和相关任务至关重要。然而,与专门的文本模型相比,CLIP模型在纯文本任务中通常表现不佳。这导致信息检索系统需要为纯文本和多模态任务保留单独的嵌入和模型,从而造成低效。为解决这一问题,我们提出了一种新颖的多任务对比训练方法,用于训练jina-clip-v1模型,在文本-图像和文本-文本检索任务上实现了最先进的性能。
English
Contrastive Language-Image Pretraining (CLIP) is widely used to train models
to align images and texts in a common embedding space by mapping them to
fixed-sized vectors. These models are key to multimodal information retrieval
and related tasks. However, CLIP models generally underperform in text-only
tasks compared to specialized text models. This creates inefficiencies for
information retrieval systems that keep separate embeddings and models for
text-only and multimodal tasks. We propose a novel, multi-task contrastive
training method to address this issue, which we use to train the jina-clip-v1
model to achieve the state-of-the-art performance on both text-image and
text-text retrieval tasks.Summary
AI-Generated Summary