Gecko:从大型语言模型中提炼出的多功能文本嵌入
Gecko: Versatile Text Embeddings Distilled from Large Language Models
March 29, 2024
作者: Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, Iftekhar Naim
cs.AI
摘要
我们提出了Gecko,一种紧凑且多功能的文本嵌入模型。Gecko通过一个关键理念实现了强大的检索性能:将大型语言模型(LLMs)的知识提炼到检索器中。我们的两步提炼过程首先使用LLM生成多样化的合成配对数据。接着,我们通过为每个查询检索一组候选段落,并使用同一LLM重新标记正样本和困难负样本段落,进一步优化数据质量。我们的方法的有效性通过Gecko的紧凑性得以体现。在大型文本嵌入基准(MTEB)上,具有256维嵌入的Gecko超越了所有现有768维嵌入的条目。具有768维嵌入的Gecko平均得分达到66.31,与7倍大小的模型和5倍高维嵌入相媲美。
English
We present Gecko, a compact and versatile text embedding model. Gecko
achieves strong retrieval performance by leveraging a key idea: distilling
knowledge from large language models (LLMs) into a retriever. Our two-step
distillation process begins with generating diverse, synthetic paired data
using an LLM. Next, we further refine the data quality by retrieving a set of
candidate passages for each query, and relabeling the positive and hard
negative passages using the same LLM. The effectiveness of our approach is
demonstrated by the compactness of the Gecko. On the Massive Text Embedding
Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing
entries with 768 embedding size. Gecko with 768 embedding dimensions achieves
an average score of 66.31, competing with 7x larger models and 5x higher
dimensional embeddings.Summary
AI-Generated Summary