ChatPaper.aiChatPaper

EmbeddingGemma:強大且輕量化的文本表徵模型

EmbeddingGemma: Powerful and Lightweight Text Representations

September 24, 2025
作者: Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, Jingxiao Zheng, Jyotinder Singh, Abheesht Sharma, Divya Sreepat, Aashi Jain, Adham Elarabawy, AJ Co, Andreas Doumanoglou, Babak Samari, Ben Hora, Brian Potetz, Dahun Kim, Enrique Alfonseca, Fedor Moiseev, Feng Han, Frank Palma Gomez, Gustavo Hernández Ábrego, Hesen Zhang, Hui Hui, Jay Han, Karan Gill, Ke Chen, Koert Chen, Madhuri Shanbhogue, Michael Boratko, Paul Suganthan, Sai Meher Karthik Duddu, Sandeep Mariserla, Setareh Ariafar, Shanfeng Zhang, Shijie Zhang, Simon Baumgartner, Sonam Goenka, Steve Qiu, Tanmaya Dabral, Trevor Walker, Vikram Rao, Waleed Khawaja, Wenlei Zhou, Xiaoqi Ren, Ye Xia, Yichang Chen, Yi-Ting Chen, Zhe Dong, Zhongli Ding, Francesco Visin, Gaël Liu, Jiageng Zhang, Kathleen Kenealy, Michelle Casbon, Ravin Kumar, Thomas Mesnard, Zach Gleicher, Cormac Brick, Olivier Lacombe, Adam Roberts, Yunhsuan Sung, Raphael Hoffmann, Tris Warkentin, Armand Joulin, Tom Duerig, Mojtaba Seyedhosseini
cs.AI

摘要

我們推出EmbeddingGemma,這是一款基於Gemma 3語言模型家族的新型輕量級開放文本嵌入模型。我們的創新訓練方法通過編碼器-解碼器初始化與幾何嵌入蒸餾策略性地從更大模型中汲取知識。我們利用分散正則化器提升模型的魯棒性和表達力,並通過合併來自多樣化優化混合的檢查點來確保其泛化能力。在Massive Text Embedding Benchmark (MTEB)上對多語言、英語及代碼領域進行評估,EmbeddingGemma(300M)取得了領先的成果。值得注意的是,它在參數量少於500M的情況下,超越了以往無論是專有還是開源的頂尖模型,並提供了與其兩倍大小模型相當的性能,展現出卓越的性能成本比。更為突出的是,在模型權重量化或嵌入輸出截斷的情況下,這一優勢依然保持。這使得EmbeddingGemma特別適合低延遲和高吞吐量的應用場景,如設備端應用。我們提供了消融研究,深入探討了關鍵設計選擇。我們向社區發布EmbeddingGemma,以促進進一步的研究。
English
We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
PDF322September 25, 2025