Gemini嵌入:來自Gemini的通用嵌入表示
Gemini Embedding: Generalizable Embeddings from Gemini
March 10, 2025
作者: Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, Xiaoqi Ren, Shanfeng Zhang, Daniel Salz, Michael Boratko, Jay Han, Blair Chen, Shuo Huang, Vikram Rao, Paul Suganthan, Feng Han, Andreas Doumanoglou, Nithi Gupta, Fedor Moiseev, Cathy Yip, Aashi Jain, Simon Baumgartner, Shahrokh Shahi, Frank Palma Gomez, Sandeep Mariserla, Min Choi, Parashar Shah, Sonam Goenka, Ke Chen, Ye Xia, Koert Chen, Sai Meher Karthik Duddu, Yichang Chen, Trevor Walker, Wenlei Zhou, Rakesh Ghiya, Zach Gleicher, Karan Gill, Zhe Dong, Mojtaba Seyedhosseini, Yunhsuan Sung, Raphael Hoffmann, Tom Duerig
cs.AI
摘要
在本報告中,我們介紹了Gemini Embedding,這是一款最先進的嵌入模型,它充分利用了Google最強大的大型語言模型Gemini的能力。憑藉Gemini固有的多語言和代碼理解能力,Gemini Embedding能夠為涵蓋多種語言和文本模式的文本生成高度泛化的嵌入表示。由Gemini Embedding生成的表示可以預先計算並應用於各種下游任務,包括分類、相似性、聚類、排序和檢索。在包含超過250種語言、一百多項任務的大規模多語言文本嵌入基準(MMTEB)上進行評估時,Gemini Embedding顯著超越了先前的最先進模型,展示了嵌入質量的顯著提升。在MMTEB的多語言、英語和代碼基準測試中均達到最先進性能,我們的統一模型在廣泛的任務選擇中展現出強大的能力,並超越了專門的領域特定模型。
English
In this report, we introduce Gemini Embedding, a state-of-the-art embedding
model leveraging the power of Gemini, Google's most capable large language
model. Capitalizing on Gemini's inherent multilingual and code understanding
capabilities, Gemini Embedding produces highly generalizable embeddings for
text spanning numerous languages and textual modalities. The representations
generated by Gemini Embedding can be precomputed and applied to a variety of
downstream tasks including classification, similarity, clustering, ranking, and
retrieval. Evaluated on the Massive Multilingual Text Embedding Benchmark
(MMTEB), which includes over one hundred tasks across 250+ languages, Gemini
Embedding substantially outperforms prior state-of-the-art models,
demonstrating considerable improvements in embedding quality. Achieving
state-of-the-art performance across MMTEB's multilingual, English, and code
benchmarks, our unified model demonstrates strong capabilities across a broad
selection of tasks and surpasses specialized domain-specific models.Summary
AI-Generated Summary