LLM2Vec:大型語言模型暗中是強大的文本編碼器
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
April 9, 2024
作者: Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy
cs.AI
摘要
大型解碼器專用語言模型(LLMs)是當今大多數自然語言處理任務和基準測試中的最先進模型。然而,社群對於將這些模型應用於需要豐富上下文表示的文本嵌入任務的採納速度較慢。在這項工作中,我們介紹了LLM2Vec,這是一種簡單的無監督方法,可以將任何解碼器專用的LLM轉換為強大的文本編碼器。LLM2Vec包括三個簡單步驟:1)啟用雙向注意力,2)遮罩下一個標記預測,以及3)無監督對比學習。我們通過將LLM2Vec應用於從13億到70億參數範圍內的3個熱門LLMs,並在英語單詞和序列級任務上評估轉換後的模型,展示了LLM2Vec的有效性。我們在單詞級任務上大幅優於僅具編碼器的模型,並在大規模文本嵌入基準測試(MTEB)上達到了新的無監督最先進性能。此外,當將LLM2Vec與監督對比學習結合時,我們在僅在公開可用數據上訓練的模型中實現了MTEB的最先進性能。我們強大的實證結果和廣泛的分析表明,LLMs可以在不需要昂貴適應或合成GPT-4生成數據的情況下,以參數高效的方式有效轉換為通用文本編碼器。
English
Large decoder-only language models (LLMs) are the state-of-the-art models on
most of today's NLP tasks and benchmarks. Yet, the community is only slowly
adopting these models for text embedding tasks, which require rich
contextualized representations. In this work, we introduce LLM2Vec, a simple
unsupervised approach that can transform any decoder-only LLM into a strong
text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional
attention, 2) masked next token prediction, and 3) unsupervised contrastive
learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3
popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed
models on English word- and sequence-level tasks. We outperform encoder-only
models by a large margin on word-level tasks and reach a new unsupervised
state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB).
Moreover, when combining LLM2Vec with supervised contrastive learning, we
achieve state-of-the-art performance on MTEB among models that train only on
publicly available data. Our strong empirical results and extensive analysis
demonstrate that LLMs can be effectively transformed into universal text
encoders in a parameter-efficient manner without the need for expensive
adaptation or synthetic GPT-4 generated data.Summary
AI-Generated Summary