대규모 언어 모델을 활용한 텍스트 임베딩 개선

초록

본 논문에서는 합성 데이터만을 사용하고 1,000회 미만의 학습 단계로 고품질 텍스트 임베딩을 얻는 새로운 간단한 방법을 소개합니다. 기존 방법들이 종종 수십억 개의 약한 감독(weakly-supervised) 텍스트 쌍을 사용한 다단계 중간 사전 학습과 소량의 레이블된 데이터셋을 통한 미세 조정에 의존하는 반면, 우리의 방법은 복잡한 학습 파이프라인을 구축하거나 작업 다양성과 언어 범위에 제약을 받는 수동으로 수집된 데이터셋에 의존할 필요가 없습니다. 우리는 독점적인 대형 언어 모델(LLM)을 활용하여 거의 100개 언어에 걸쳐 수십만 개의 텍스트 임베딩 작업을 위한 다양한 합성 데이터를 생성합니다. 그런 다음, 오픈소스 디코더 전용(decoder-only) LLM을 합성 데이터에 대해 표준 대조 손실(contrastive loss)을 사용하여 미세 조정합니다. 실험 결과, 우리의 방법은 레이블된 데이터를 전혀 사용하지 않고도 경쟁력 있는 텍스트 임베딩 벤치마크에서 강력한 성능을 달성함을 보여줍니다. 더 나아가, 합성 데이터와 레이블된 데이터를 혼합하여 미세 조정할 경우, 우리의 모델은 BEIR 및 MTEB 벤치마크에서 새로운 최첨단(state-of-the-art) 결과를 달성합니다.

English

In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.

대규모 언어 모델을 활용한 텍스트 임베딩 개선

Improving Text Embeddings with Large Language Models

초록

Support