Gemini 嵌入 2：來自 Gemini 的原生多模態嵌入模型

摘要

我們介紹 Gemini Embedding 2，這是一個原生多模態嵌入模型，可將影片、音訊、圖片和文字等模態嵌入至統一的表徵空間。我們利用 Gemini 的多模態能力，為這些模態中任意交錯輸入組合生成嵌入，並在廣泛任務中展現良好的泛化能力。透過在多任務多階段訓練架構中應用大規模對比學習，我們在多項關鍵嵌入基準測試（包括涵蓋多樣任務的單模態、跨模態與多模態檢索）上達成最先進的性能。結果顯示，我們的嵌入模型在各類任務中表現優異（MSCOCO 上 R@1 達 62.9，Vatex 上 NDCG@10 達 68.8，MTEB 多語言達 69.9，MTEB 程式碼達 84.0），超越專門模型的表現。這些統一能力使 Gemini Embedding 2 成為 RAG、推薦系統與搜尋等下游應用的極具潛力選擇。此外，其在從天文學、生物科學到美術與烹飪藝術等不同領域的強大零樣本性能，更使其成為即使在專業領域也能即時提供高度可靠表徵的解決方案。

English

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.