LLM2Vec：大型語言模型暗中是強大的文本編碼器

摘要

大型解碼器專用語言模型（LLMs）是當今大多數自然語言處理任務和基準測試中的最先進模型。然而，社群對於將這些模型應用於需要豐富上下文表示的文本嵌入任務的採納速度較慢。在這項工作中，我們介紹了LLM2Vec，這是一種簡單的無監督方法，可以將任何解碼器專用的LLM轉換為強大的文本編碼器。LLM2Vec包括三個簡單步驟：1）啟用雙向注意力，2）遮罩下一個標記預測，以及3）無監督對比學習。我們通過將LLM2Vec應用於從13億到70億參數範圍內的3個熱門LLMs，並在英語單詞和序列級任務上評估轉換後的模型，展示了LLM2Vec的有效性。我們在單詞級任務上大幅優於僅具編碼器的模型，並在大規模文本嵌入基準測試（MTEB）上達到了新的無監督最先進性能。此外，當將LLM2Vec與監督對比學習結合時，我們在僅在公開可用數據上訓練的模型中實現了MTEB的最先進性能。我們強大的實證結果和廣泛的分析表明，LLMs可以在不需要昂貴適應或合成GPT-4生成數據的情況下，以參數高效的方式有效轉換為通用文本編碼器。

English

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

LLM2Vec：大型語言模型暗中是強大的文本編碼器

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

摘要

Summary

Support

Support