LLM2Vec：大型语言模型暗中是强大的文本编码器

摘要

大型仅解码器语言模型（LLMs）是当今大多数自然语言处理任务和基准测试中的最先进模型。然而，社区对将这些模型用于需要丰富上下文表示的文本嵌入任务的接受速度较慢。在这项工作中，我们介绍了LLM2Vec，这是一种简单的无监督方法，可以将任何仅解码器LLM转换为强大的文本编码器。LLM2Vec包括三个简单步骤：1）启用双向注意力，2）掩码下一个标记预测，3）无监督对比学习。我们通过将LLM2Vec应用于从13亿到70亿参数的3个流行LLM，并在英语单词和序列级任务上评估转换后的模型，展示了LLM2Vec的有效性。我们在单词级任务上大幅领先于仅编码器模型，并在大规模文本嵌入基准测试（MTEB）上达到了新的无监督最先进性能。此外，当将LLM2Vec与监督对比学习相结合时，我们在MTEB上实现了在仅在公开可用数据上训练的模型中的最先进性能。我们强有力的实证结果和广泛的分析表明，LLMs可以在不需要昂贵的调整或合成GPT-4生成数据的情况下，以参数高效的方式有效地转换为通用文本编码器。

English

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

LLM2Vec：大型语言模型暗中是强大的文本编码器

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

摘要

Support