ChatPaper.aiChatPaper

EmbeddingGemma:强大而轻量级的文本表征模型

EmbeddingGemma: Powerful and Lightweight Text Representations

September 24, 2025
作者: Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, Jingxiao Zheng, Jyotinder Singh, Abheesht Sharma, Divya Sreepat, Aashi Jain, Adham Elarabawy, AJ Co, Andreas Doumanoglou, Babak Samari, Ben Hora, Brian Potetz, Dahun Kim, Enrique Alfonseca, Fedor Moiseev, Feng Han, Frank Palma Gomez, Gustavo Hernández Ábrego, Hesen Zhang, Hui Hui, Jay Han, Karan Gill, Ke Chen, Koert Chen, Madhuri Shanbhogue, Michael Boratko, Paul Suganthan, Sai Meher Karthik Duddu, Sandeep Mariserla, Setareh Ariafar, Shanfeng Zhang, Shijie Zhang, Simon Baumgartner, Sonam Goenka, Steve Qiu, Tanmaya Dabral, Trevor Walker, Vikram Rao, Waleed Khawaja, Wenlei Zhou, Xiaoqi Ren, Ye Xia, Yichang Chen, Yi-Ting Chen, Zhe Dong, Zhongli Ding, Francesco Visin, Gaël Liu, Jiageng Zhang, Kathleen Kenealy, Michelle Casbon, Ravin Kumar, Thomas Mesnard, Zach Gleicher, Cormac Brick, Olivier Lacombe, Adam Roberts, Yunhsuan Sung, Raphael Hoffmann, Tris Warkentin, Armand Joulin, Tom Duerig, Mojtaba Seyedhosseini
cs.AI

摘要

我们推出EmbeddingGemma,这是一款基于Gemma 3语言模型家族的新型轻量级开放文本嵌入模型。通过创新的训练方案,我们策略性地利用编码器-解码器初始化和几何嵌入蒸馏技术,从更大模型中汲取知识。采用分散正则化器增强了模型的鲁棒性与表达能力,并通过融合来自不同优化混合的检查点确保了泛化能力。在跨多语言、英语及代码领域的大规模文本嵌入基准测试(MTEB)中,EmbeddingGemma(300M)取得了业界领先的成绩。尤为突出的是,它在参数少于500M的情况下,超越了以往无论是专有还是开放的所有顶尖模型,并提供了与两倍大小模型相媲美的性能,展现出卓越的性价比。值得注意的是,即便在量化模型权重或截断嵌入输出时,这一优势依然保持,使得EmbeddingGemma特别适合低延迟、高吞吐量的应用场景,如设备端应用。我们提供了消融研究,深入探讨了关键设计决策。为了推动进一步研究,我们向社区发布了EmbeddingGemma。
English
We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
PDF322September 25, 2025