Jina CLIP：あなたのCLIPモデルはテキスト検索エンジンでもある

要旨

コントラスティブ言語-画像事前学習（CLIP）は、画像とテキストを固定サイズのベクトルにマッピングすることで、共通の埋め込み空間で整合させるモデルを訓練するために広く使用されています。これらのモデルは、マルチモーダル情報検索や関連タスクにおいて重要な役割を果たします。しかし、CLIPモデルは、専門的なテキストモデルと比較して、テキストのみのタスクでは一般的に性能が低いです。これにより、テキストのみのタスクとマルチモーダルタスクのために別々の埋め込みとモデルを保持する情報検索システムに非効率性が生じます。この問題に対処するため、我々は新しいマルチタスクコントラスティブ訓練手法を提案し、それを用いてjina-clip-v1モデルを訓練し、テキスト-画像検索とテキスト-テキスト検索の両方のタスクにおいて最先端の性能を達成しました。

English

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

Jina CLIP：あなたのCLIPモデルはテキスト検索エンジンでもある

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

要旨

Support