ChatPaper.aiChatPaper

jina-embeddings-v5-text:面向特定任务的嵌入蒸馏技术

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

February 17, 2026
作者: Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, Han Xiao
cs.AI

摘要

文本嵌入模型广泛应用于语义相似性任务,包括信息检索、聚类和分类。通用模型通常通过单阶段或多阶段对比损失函数进行训练。我们提出了一种创新训练方案,将模型蒸馏技术与任务特定对比损失相结合,以生成紧凑的高性能嵌入模型。研究结果表明,相较于纯对比学习或蒸馏训练范式,该方法在训练小规模模型时更具优势。最终模型jina-embeddings-v5-text-small和jina-embeddings-v5-text-nano的基准测试分数超越或持平同类尺寸的顶尖模型。jina-embeddings-v5系列模型还支持多语言长文本(最高3.2万词元),生成的嵌入向量在截断和二进制量化下仍保持鲁棒性。模型权重已开源发布,有望推动嵌入模型研究的进一步发展。
English
Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.
PDF91February 19, 2026