テキスト埋め込み器を少数ショット学習者にする

要旨

デコーダーのみのアーキテクチャを持つ大規模言語モデル（LLMs）は、コンテキスト内学習（ICL）能力を示す顕著な特性を示しています。この特徴により、彼らは入力コンテキスト内で提供された例を利用して、なじみのあるタスクと新しいタスクの両方を効果的に処理することができます。この能力の潜在性を認識し、我々はLLMs内のICL機能を活用してテキスト埋め込み生成プロセスを向上させることを提案します。このために、高品質なテキスト埋め込みを生成するために少数の例を利用する新しいモデルbge-en-iclを導入します。我々のアプローチは、タスクに関連する例をクエリ側に直接統合することで、さまざまなタスクで大幅な改善をもたらします。さらに、異なる注意メカニズム、プーリング方法などを含む埋め込みモデルとしてLLMsを効果的に活用する方法について調査しています。我々の調査結果は、元のフレームワークを保持することがしばしば最良の結果をもたらすことを強調し、シンプルさが最善であることを示唆しています。MTEBおよびAIR-Benchのベンチマークでの実験結果は、我々のアプローチが新たな最先端のパフォーマンスを実現していることを示しています。当該モデル、コード、およびデータセットは、https://github.com/FlagOpen/FlagEmbedding で無料で入手可能です。

English

Large language models (LLMs) with decoder-only architectures demonstrate remarkable in-context learning (ICL) capabilities. This feature enables them to effectively handle both familiar and novel tasks by utilizing examples provided within their input context. Recognizing the potential of this capability, we propose leveraging the ICL feature in LLMs to enhance the process of text embedding generation. To this end, we introduce a novel model bge-en-icl, which employs few-shot examples to produce high-quality text embeddings. Our approach integrates task-related examples directly into the query side, resulting in significant improvements across various tasks. Additionally, we have investigated how to effectively utilize LLMs as embedding models, including various attention mechanisms, pooling methods, etc. Our findings suggest that retaining the original framework often yields the best results, underscoring that simplicity is best. Experimental results on the MTEB and AIR-Bench benchmarks demonstrate that our approach sets new state-of-the-art (SOTA) performance. Our model, code and dataset are freely available at https://github.com/FlagOpen/FlagEmbedding .

テキスト埋め込み器を少数ショット学習者にする

Making Text Embedders Few-Shot Learners

要旨

Support