LLM2Vec: 大規模言語モデルは密かに強力なテキストエンコーダである

要旨

大規模なデコーダー専用言語モデル（LLM）は、現在のほとんどのNLPタスクとベンチマークにおいて最先端のモデルです。しかし、テキスト埋め込みタスク（豊かな文脈化された表現を必要とする）において、これらのモデルがコミュニティに採用されるのはまだ遅々としています。本研究では、任意のデコーダー専用LLMを強力なテキストエンコーダーに変換するシンプルな教師なしアプローチであるLLM2Vecを紹介します。LLM2Vecは、次の3つのシンプルなステップで構成されます：1) 双方向アテンションの有効化、2) マスクされた次トークン予測、3) 教師なしコントラスティブ学習。1.3Bから7Bパラメータまでの3つの人気LLMにLLM2Vecを適用し、変換されたモデルを英語の単語レベルおよびシーケンスレベルのタスクで評価することで、その有効性を実証します。単語レベルのタスクではエンコーダー専用モデルを大きく上回り、Massive Text Embeddings Benchmark（MTEB）において新しい教師なしの最先端性能を達成しました。さらに、LLM2Vecを教師ありコントラスティブ学習と組み合わせることで、公開されているデータのみで学習するモデルの中でMTEBにおける最先端性能を達成しました。私たちの強力な実験結果と詳細な分析は、LLMが高価な適応やGPT-4生成の合成データを必要とせず、パラメータ効率の良い方法で普遍的なテキストエンコーダーに効果的に変換できることを示しています。

English

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

LLM2Vec: 大規模言語モデルは密かに強力なテキストエンコーダである

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

要旨

Support