ChatPaper.aiChatPaper

Gecko:从大型语言模型中提炼出的多功能文本嵌入

Gecko: Versatile Text Embeddings Distilled from Large Language Models

March 29, 2024
作者: Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, Iftekhar Naim
cs.AI

摘要

我们提出了Gecko,一种紧凑且多功能的文本嵌入模型。Gecko通过一个关键理念实现了强大的检索性能:将大型语言模型(LLMs)的知识提炼到检索器中。我们的两步提炼过程首先使用LLM生成多样化的合成配对数据。接着,我们通过为每个查询检索一组候选段落,并使用同一LLM重新标记正样本和困难负样本段落,进一步优化数据质量。我们的方法的有效性通过Gecko的紧凑性得以体现。在大型文本嵌入基准(MTEB)上,具有256维嵌入的Gecko超越了所有现有768维嵌入的条目。具有768维嵌入的Gecko平均得分达到66.31,与7倍大小的模型和5倍高维嵌入相媲美。
English
We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.

Summary

AI-Generated Summary

PDF494November 26, 2024