LLM2Vec: 대형 언어 모델은 사실 강력한 텍스트 인코더입니다

초록

대규모 디코더 전용 언어 모델(LLM)은 현재 대부분의 NLP 작업과 벤치마크에서 최첨단 모델로 자리 잡고 있습니다. 그러나 이러한 모델들은 풍부한 문맥화된 표현이 필요한 텍스트 임베딩 작업에는 아직까지 서서히 도입되고 있는 실정입니다. 본 연구에서는 디코더 전용 LLM을 강력한 텍스트 인코더로 변환할 수 있는 간단한 비지도 접근 방식인 LLM2Vec을 소개합니다. LLM2Vec은 세 가지 간단한 단계로 구성됩니다: 1) 양방향 어텐션 활성화, 2) 마스킹된 다음 토큰 예측, 3) 비지도 대조 학습. 우리는 1.3B에서 7B 파라미터에 이르는 3개의 인기 있는 LLM에 LLM2Vec을 적용하여 그 효과를 입증하고, 변환된 모델을 영어 단어 및 시퀀스 수준 작업에서 평가했습니다. 단어 수준 작업에서는 인코더 전용 모델을 큰 차이로 앞섰으며, Massive Text Embeddings Benchmark(MTEB)에서 새로운 비지도 최첨단 성능을 달성했습니다. 또한, LLM2Vec에 지도 대조 학습을 결합했을 때에는 공개적으로 이용 가능한 데이터만으로 학습한 모델 중에서 MTEB에서 최첨단 성능을 달성했습니다. 우리의 강력한 실험 결과와 광범위한 분석은 LLM이 고가의 적응이나 GPT-4 생성 합성 데이터 없이도 파라미터 효율적인 방식으로 범용 텍스트 인코더로 효과적으로 변환될 수 있음을 보여줍니다.

English

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

LLM2Vec: 대형 언어 모델은 사실 강력한 텍스트 인코더입니다

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

초록

Support