ChatPaper.aiChatPaper

Gemini 嵌入 2:來自 Gemini 的原生多模態嵌入模型

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

May 26, 2026
作者: Madhuri Shanbhogue, Zhe Li, Shanfeng Zhang, Gustavo Hernández Ábrego, Shih-Cheng Huang, Aashi Jain, Daniel Salz, Sonam Goenka, Chaitra Hegde, Ji Ma, Feiyang Chen, Jiaxing Wu, Tanmaya Dabral, Babak Samari, Kevin Poulet, Daniel Cer, Kaifeng Chen, Paul Suganathan, Hui Hui, Jovan Andonov, Philippe Schlattner, Jay Han, Iftekhar Naim, Wing Lowe, Vladimir Pchelin, Albert Yang, Yi-Ting Chen, Zhongli Ding, Grace Zhang, Georg Heigold, Yichang Chen, Antoine Reveillon, Brendan Mccloskey, Wenlei Zhou, Dahun Kim, Rui Meng, Emma Wang, Jack Zheng, Halley Fede, Zhen Yang, Keegan Mosley, Brian Potetz, Sahil Dua, Henrique Schechter Vera, Shen Gao, Hesen Zhang, Andreas Hess, Hengxuan Ying, Alberto Montes, Karan Gill, Min Choi, Sebastian Russo, Anja Hauth, Jinhyuk Lee, Michael Boratko, Megan Barnes, Vikram Rao, Claudiu Musat, Cyril Allauzen, Ehsan Variani, Shankar Kumar, Tom Bagby, Junyi Jiao, Yang Gu, Tengxin Li, Ayush Agrawal, Roberto Santana, Dev Nath, Stephen Karukas, Shuoxuan Han, Lucia Loher, Alice Twu, Nidhi Vyas, Siddharth Bhai, Frank Palma Gomez, Wangyuan Zhang, Chaoren Liu, Jizheng Yang, Steve Qiu, Shijie Zhang, Sujay Kulkarni, Sascha Rothe, Sean Nakamoto, Raphael Hoffmann, Zach Gleicher, Yunhsuan Sung, Qin Yin, Tom Duerig, Mojtaba Seyedhosseini
cs.AI

摘要

我們介紹 Gemini Embedding 2,這是一個原生多模態嵌入模型,可將影片、音訊、圖片和文字等模態嵌入至統一的表徵空間。我們利用 Gemini 的多模態能力,為這些模態中任意交錯輸入組合生成嵌入,並在廣泛任務中展現良好的泛化能力。透過在多任務多階段訓練架構中應用大規模對比學習,我們在多項關鍵嵌入基準測試(包括涵蓋多樣任務的單模態、跨模態與多模態檢索)上達成最先進的性能。結果顯示,我們的嵌入模型在各類任務中表現優異(MSCOCO 上 R@1 達 62.9,Vatex 上 NDCG@10 達 68.8,MTEB 多語言達 69.9,MTEB 程式碼達 84.0),超越專門模型的表現。這些統一能力使 Gemini Embedding 2 成為 RAG、推薦系統與搜尋等下游應用的極具潛力選擇。此外,其在從天文學、生物科學到美術與烹飪藝術等不同領域的強大零樣本性能,更使其成為即使在專業領域也能即時提供高度可靠表徵的解決方案。
English
We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.