利用大型語言模型改善文本嵌入

摘要

本文介紹了一種新穎且簡單的方法，僅使用合成數據和不到1k個訓練步驟即可獲得高質量的文本嵌入。與現有方法不同，現有方法通常依賴於使用數十億個弱監督文本對進行多階段中間預訓練，然後再通過少量標記數據進行微調。我們的方法不需要構建複雜的訓練流程，也不依賴於通常受任務多樣性和語言覆蓋範圍限制的手動收集數據。我們利用專有的LLMs在近100種語言中為數十萬個文本嵌入任務生成多樣的合成數據。然後，我們使用標準對比損失在合成數據上對開源的僅解碼器LLMs進行微調。實驗表明，我們的方法在高度競爭的文本嵌入基準測試中取得了優異表現，而無需使用任何標記數據。此外，當使用合成和標記數據混合進行微調時，我們的模型在BEIR和MTEB基準測試上創下了新的最佳結果。

English

In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.

利用大型語言模型改善文本嵌入

Improving Text Embeddings with Large Language Models

摘要

Support