NV-Embed: 범용 임베딩 모델로서의 대형 언어 모델 훈련을 위한 개선된 기법

초록

디코더 전용 대형 언어 모델(LLM) 기반 임베딩 모델이 일반적인 텍스트 임베딩 작업, 특히 밀집 벡터 기반 검색에서 BERT나 T5 기반 임베딩 모델을 능가하기 시작하고 있습니다. 본 연구에서는 LLM을 다목적 임베딩 모델로 활용할 때 성능을 크게 향상시키면서도 단순성과 재현성을 유지할 수 있는 다양한 아키텍처 설계 및 학습 절차를 도입한 NV-Embed 모델을 소개합니다. 모델 아키텍처 측면에서는 풀링된 임베딩을 얻기 위해 잠재 어텐션 레이어를 제안하며, 이는 LLM의 평균 풀링이나 마지막 <EOS> 토큰 임베딩 사용에 비해 검색 및 다운스트림 작업 정확도를 지속적으로 개선합니다. 표현 학습을 강화하기 위해, 대조 학습 중에 LLM의 인과적 어텐션 마스크를 제거합니다. 모델 학습 측면에서는 두 단계의 대조적 명령어 튜닝 방법을 도입합니다. 첫 번째 단계에서는 검색 데이터셋에 대해 명령어와 함께 대조 학습을 적용하며, 배치 내 부정 예제와 선별된 어려운 부정 예제를 활용합니다. 두 번째 단계에서는 다양한 비검색 데이터셋을 명령어 튜닝에 통합하여, 비검색 작업 정확도를 향상시킬 뿐만 아니라 검색 성능도 개선합니다. 이러한 기법들을 결합하여, 공개 데이터만을 사용한 NV-Embed 모델은 Massive Text Embedding Benchmark(MTEB)(2024년 5월 24일 기준)에서 56개 작업(검색, 재순위화, 분류, 클러스터링, 의미적 텍스트 유사성 작업 포함)에 대해 69.32점의 기록적인 최고 점수를 달성하며 1위를 차지했습니다. 특히, MTEB 벤치마크(또는 BEIR)의 15개 검색 작업에서도 59.36점의 최고 점수를 기록했습니다. 모델은 https://huggingface.co/nvidia/NV-Embed-v1에서 오픈소스로 공개될 예정입니다.

English

Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model with a variety of architectural designs and training procedures to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last <EOS> token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For model training, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval datasets into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. Combining these techniques, our NV-Embed model, using only publicly available data, has achieved a record-high score of 69.32, ranking No. 1 on the Massive Text Embedding Benchmark (MTEB) (as of May 24, 2024), with 56 tasks, encompassing retrieval, reranking, classification, clustering, and semantic textual similarity tasks. Notably, our model also attains the highest score of 59.36 on 15 retrieval tasks in the MTEB benchmark (also known as BEIR). We will open-source the model at: https://huggingface.co/nvidia/NV-Embed-v1.

NV-Embed: 범용 임베딩 모델로서의 대형 언어 모델 훈련을 위한 개선된 기법

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

초록

Summary

Support

Support