jina-embeddings-v5-text:面向特定任務的嵌入蒸餾技術
jina-embeddings-v5-text: Task-Targeted Embedding Distillation
February 17, 2026
作者: Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, Han Xiao
cs.AI
摘要
文本嵌入模型廣泛應用於語義相似度任務,包括資訊檢索、聚類分析和分類任務。通用型模型通常採用單階段或多階段的對比損失函數進行訓練。我們提出了一種新穎的訓練方案,結合模型蒸餾技術與任務特定對比損失,以產生緊湊高效能的嵌入模型。研究結果表明,相較於單純使用對比學習或蒸餾訓練範式,此方法在訓練小型模型時更具成效。最終模型 jina-embeddings-v5-text-small 與 jina-embeddings-v5-text-nano 的基準測試成績,在同等規模模型中達到或超越了現有最先進水準。jina-embeddings-v5-text 系列模型還具備多語言長文本處理能力(最高支援 32k 詞元),其生成的嵌入向量在截斷處理與二值量化後仍保持穩健特性。模型權重已公開釋出,期待能激勵嵌入模型開發領域的進一步創新。
English
Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.