EmbeddingGemma：強大且輕量化的文本表徵模型

摘要

我們推出EmbeddingGemma，這是一款基於Gemma 3語言模型家族的新型輕量級開放文本嵌入模型。我們的創新訓練方法通過編碼器-解碼器初始化與幾何嵌入蒸餾策略性地從更大模型中汲取知識。我們利用分散正則化器提升模型的魯棒性和表達力，並通過合併來自多樣化優化混合的檢查點來確保其泛化能力。在Massive Text Embedding Benchmark (MTEB)上對多語言、英語及代碼領域進行評估，EmbeddingGemma（300M）取得了領先的成果。值得注意的是，它在參數量少於500M的情況下，超越了以往無論是專有還是開源的頂尖模型，並提供了與其兩倍大小模型相當的性能，展現出卓越的性能成本比。更為突出的是，在模型權重量化或嵌入輸出截斷的情況下，這一優勢依然保持。這使得EmbeddingGemma特別適合低延遲和高吞吐量的應用場景，如設備端應用。我們提供了消融研究，深入探討了關鍵設計選擇。我們向社區發布EmbeddingGemma，以促進進一步的研究。

English

We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.

EmbeddingGemma：強大且輕量化的文本表徵模型

EmbeddingGemma: Powerful and Lightweight Text Representations

摘要

Support