EmbeddingGemma：强大而轻量级的文本表征模型

摘要

我们推出EmbeddingGemma，这是一款基于Gemma 3语言模型家族的新型轻量级开放文本嵌入模型。通过创新的训练方案，我们策略性地利用编码器-解码器初始化和几何嵌入蒸馏技术，从更大模型中汲取知识。采用分散正则化器增强了模型的鲁棒性与表达能力，并通过融合来自不同优化混合的检查点确保了泛化能力。在跨多语言、英语及代码领域的大规模文本嵌入基准测试（MTEB）中，EmbeddingGemma（300M）取得了业界领先的成绩。尤为突出的是，它在参数少于500M的情况下，超越了以往无论是专有还是开放的所有顶尖模型，并提供了与两倍大小模型相媲美的性能，展现出卓越的性价比。值得注意的是，即便在量化模型权重或截断嵌入输出时，这一优势依然保持，使得EmbeddingGemma特别适合低延迟、高吞吐量的应用场景，如设备端应用。我们提供了消融研究，深入探讨了关键设计决策。为了推动进一步研究，我们向社区发布了EmbeddingGemma。

English

We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.

EmbeddingGemma：强大而轻量级的文本表征模型

EmbeddingGemma: Powerful and Lightweight Text Representations

摘要

Support