文本嵌入器的少样本学习
Making Text Embedders Few-Shot Learners
September 24, 2024
作者: Chaofan Li, MingHao Qin, Shitao Xiao, Jianlyu Chen, Kun Luo, Yingxia Shao, Defu Lian, Zheng Liu
cs.AI
摘要
具有仅解码器架构的大型语言模型(LLMs)展现出出色的上下文学习(ICL)能力。这一特性使它们能够通过利用输入上下文中提供的示例,有效地处理熟悉和新颖的任务。认识到这一能力的潜力,我们提议利用LLMs中的ICL特性来增强文本嵌入生成过程。为此,我们引入了一种新颖的模型bge-en-icl,它利用少量示例生成高质量的文本嵌入。我们的方法将与任务相关的示例直接整合到查询侧,从而在各种任务中实现显著改进。此外,我们还研究了如何有效利用LLMs作为嵌入模型,包括各种注意机制、池化方法等。我们的研究结果表明,保留原始框架通常会产生最佳结果,强调简单即是最好的。在MTEB和AIR-Bench基准测试上的实验结果表明,我们的方法取得了新的最先进性能。我们的模型、代码和数据集可在https://github.com/FlagOpen/FlagEmbedding 免费获取。
English
Large language models (LLMs) with decoder-only architectures demonstrate
remarkable in-context learning (ICL) capabilities. This feature enables them to
effectively handle both familiar and novel tasks by utilizing examples provided
within their input context. Recognizing the potential of this capability, we
propose leveraging the ICL feature in LLMs to enhance the process of text
embedding generation. To this end, we introduce a novel model bge-en-icl, which
employs few-shot examples to produce high-quality text embeddings. Our approach
integrates task-related examples directly into the query side, resulting in
significant improvements across various tasks. Additionally, we have
investigated how to effectively utilize LLMs as embedding models, including
various attention mechanisms, pooling methods, etc. Our findings suggest that
retaining the original framework often yields the best results, underscoring
that simplicity is best. Experimental results on the MTEB and AIR-Bench
benchmarks demonstrate that our approach sets new state-of-the-art (SOTA)
performance. Our model, code and dataset are freely available at
https://github.com/FlagOpen/FlagEmbedding .Summary
AI-Generated Summary