NV-Embed:改進的技術用於訓練LLM作為通用嵌入模型
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
May 27, 2024
作者: Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
cs.AI
摘要
基於解碼器的大型語言模型(LLM)嵌入模型開始在一般文本嵌入任務中表現優於基於BERT或T5的嵌入模型,包括基於密集向量的檢索。在這項工作中,我們引入了NV-Embed模型,具有多種架構設計和訓練程序,以顯著提升LLM作為多功能嵌入模型的性能,同時保持其簡單性和可重現性。對於模型架構,我們提出了一個潛在的注意力層來獲取池化嵌入,與使用平均池化或從LLM中使用最後的<EOS>標記嵌入相比,這一方法始終能夠提高檢索和下游任務的準確性。為了增強表示學習,我們在對比訓練期間移除了LLM的因果關注遮罩。對於模型訓練,我們引入了一種兩階段對比指導調整方法。首先,它應用帶有檢索數據集指導的對比訓練,利用批內負例和精心挑選的困難負例。在第二階段,它將各種非檢索數據集融入指導調整中,這不僅提高了非檢索任務的準確性,還改善了檢索性能。通過結合這些技術,我們的NV-Embed模型僅使用公開可用數據,在2024年5月24日達到了69.32的最高分,排名Massive Text Embedding Benchmark(MTEB)第一(截至2024年5月24日),包括檢索、重新排序、分類、聚類和語義文本相似性任務在內的56個任務。值得注意的是,我們的模型還在MTEB基準測試中的15個檢索任務中取得了59.36的最高分(也稱為BEIR)。我們將在以下位置開源模型:https://huggingface.co/nvidia/NV-Embed-v1。
English
Decoder-only large language model (LLM)-based embedding models are beginning
to outperform BERT or T5-based embedding models in general-purpose text
embedding tasks, including dense vector-based retrieval. In this work, we
introduce the NV-Embed model with a variety of architectural designs and
training procedures to significantly enhance the performance of LLM as a
versatile embedding model, while maintaining its simplicity and
reproducibility. For model architecture, we propose a latent attention layer to
obtain pooled embeddings, which consistently improves retrieval and downstream
task accuracy compared to mean pooling or using the last <EOS> token embedding
from LLMs. To enhance representation learning, we remove the causal attention
mask of LLMs during contrastive training. For model training, we introduce a
two-stage contrastive instruction-tuning method. It first applies contrastive
training with instructions on retrieval datasets, utilizing in-batch negatives
and curated hard negative examples. At stage-2, it blends various non-retrieval
datasets into instruction tuning, which not only enhances non-retrieval task
accuracy but also improves retrieval performance. Combining these techniques,
our NV-Embed model, using only publicly available data, has achieved a
record-high score of 69.32, ranking No. 1 on the Massive Text Embedding
Benchmark (MTEB) (as of May 24, 2024), with 56 tasks, encompassing retrieval,
reranking, classification, clustering, and semantic textual similarity tasks.
Notably, our model also attains the highest score of 59.36 on 15 retrieval
tasks in the MTEB benchmark (also known as BEIR). We will open-source the model
at: https://huggingface.co/nvidia/NV-Embed-v1.