Gemini Embedding 2: Geminiによるネイティブマルチモーダル埋め込みモデル

要旨

我々は、動画、音声、画像、テキストの各モダリティを統合された表現空間に埋め込むことを可能にする、ネイティブなマルチモーダル埋め込みモデル「Gemini Embedding 2」を紹介する。Geminiのマルチモーダル能力を活用し、これらすべてのモダリティにわたるインターリーブされた入力の任意の組み合わせに対して埋め込みを生成し、多様なタスクにわたって優れた汎化性能を実現する。マルチタスク・多段階の学習設定において大規模な対照学習を適用することで、単一モダリティ、クロスモーダル、およびマルチモーダル検索を含む多様なタスクセットにわたる主要な埋め込みベンチマークにおいて、最先端の性能を達成した。本埋め込みモデルは、MSCOCOで62.9のR@1、Vatexで68.8のNDCG@10、MTEB多言語で69.9、MTEB Codeで84.0というスコアを達成し、専門化されたモデルの性能を上回る、多様なタスクにわたる強力な性能を示す。これらの統合された能力により、Gemini Embedding 2はRAG、レコメンデーション、検索といった下流ユースケースにおける有望な候補となる。さらに、天文学や生命科学から美術や料理芸術に至るまで、異なる分野にわたる堅牢なゼロショット性能は、専門領域においても高い信頼性を持つ即時利用可能な表現として確立している。

English

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.