ChatPaper.aiChatPaper

jina-embeddings-v3:具有任务LoRA的多语言嵌入

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

September 16, 2024
作者: Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao
cs.AI

摘要

我们介绍了jina-embeddings-v3,这是一个新颖的文本嵌入模型,拥有5.7亿个参数,在多语言数据和长上下文检索任务中实现了最先进的性能,支持长达8192个标记的上下文长度。该模型包括一组特定任务的低秩适应(LoRA)适配器,用于为查询-文档检索、聚类、分类和文本匹配生成高质量的嵌入。此外,Matryoshka表示学习被整合到训练过程中,允许灵活截断嵌入维度而不影响性能。在MTEB基准测试上的评估显示,jina-embeddings-v3在英语任务上优于来自OpenAI和Cohere的最新专有嵌入,同时在所有多语言任务中与多语言-e5-large-instruct相比表现更优异。
English
We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.

Summary

AI-Generated Summary

PDF326November 16, 2024