Gecko:源自大型語言模型的多功能文字嵌入
Gecko: Versatile Text Embeddings Distilled from Large Language Models
March 29, 2024
作者: Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, Iftekhar Naim
cs.AI
摘要
我們介紹了 Gecko,一個小巧且多功能的文字嵌入模型。Gecko 通過利用一個關鍵思想實現了強大的檢索性能:將大型語言模型(LLMs)中的知識提煉到一個檢索器中。我們的兩步提煉過程始於使用LLM生成多樣化的合成配對數據。接下來,我們通過為每個查詢檢索一組候選段落,並使用相同的LLM重新標記正面和困難的負面段落,進一步改進數據質量。我們方法的有效性通過 Gecko 的緊湊性得以證明。在大規模文本嵌入基準測試(MTEB)中,具有256嵌入維度的 Gecko 優於所有現有的768嵌入尺寸條目。具有768嵌入維度的 Gecko 實現了66.31的平均分數,與7倍更大的模型和5倍更高維度的嵌入進行競爭。
English
We present Gecko, a compact and versatile text embedding model. Gecko
achieves strong retrieval performance by leveraging a key idea: distilling
knowledge from large language models (LLMs) into a retriever. Our two-step
distillation process begins with generating diverse, synthetic paired data
using an LLM. Next, we further refine the data quality by retrieving a set of
candidate passages for each query, and relabeling the positive and hard
negative passages using the same LLM. The effectiveness of our approach is
demonstrated by the compactness of the Gecko. On the Massive Text Embedding
Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing
entries with 768 embedding size. Gecko with 768 embedding dimensions achieves
an average score of 66.31, competing with 7x larger models and 5x higher
dimensional embeddings.Summary
AI-Generated Summary