Gemini Embedding 2: Een native multimodaal embeddingmodel van Gemini

Samenvatting

Wij introduceren Gemini Embedding 2, een native multimodaal inbeddingsmodel dat het mogelijk maakt om video-, audio-, beeld- en tekstmodaliteiten in te bedden in een uniforme representatieruimte. Wij benutten de multimodale capaciteiten van Gemini om inbeddingen te genereren voor willekeurige combinaties van door elkaar lopende inputs over al deze modaliteiten, die goed generaliseren over een breed scala aan taken. Door grootschalig contrastief leren toe te passen in een multi-task multi-stage trainingsopzet, behalen we state-of-the-art prestaties op belangrijke inbeddingsbenchmarks, waaronder unimodale, cross-modale en multimodale terugwinning voor een divers takenpakket. We tonen aan dat ons inbeddingsmodel sterke prestaties levert (met een score van 62,9 R@1 op MSCOCO, 68,8 NDCG@10 op Vatex, 69,9 op MTEB meertalig en 84,0 op MTEB Code) over een verscheidenheid aan taken, waarmee het de prestaties van gespecialiseerde modellen overtreft. Deze uniforme mogelijkheden maken Gemini Embedding 2 tot een veelbelovende kandidaat voor downstream-toepassingen zoals RAG, aanbevelingen en zoekopdrachten. Bovendien bevestigen de robuuste zero-shot-prestaties op uiteenlopende gebieden – van astronomie en biowetenschappen tot beeldende kunst en culinaire kunst – het model als een zeer betrouwbare, out-of-the-box-representatie, zelfs voor gespecialiseerde domeinen.

English

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.