Gemini Embedding 2: Gemini의 네이티브 멀티모달 임베딩 모델

초록

본 논문에서는 비디오, 오디오, 이미지, 텍스트 모달리티를 통합 표현 공간에 임베딩할 수 있는 네이티브 멀티모달 임베딩 모델인 Gemini Embedding 2를 소개한다. Gemini의 멀티모달 능력을 활용하여, 다양한 작업에 걸쳐 잘 일반화되는 이러한 모든 모달리티에 걸친 임의의 인터리브 입력 조합에 대한 임베딩을 생성한다. 다중 작업, 다중 단계 훈련 설정에서 대규모 대조 학습을 적용함으로써, 다양한 작업으로 구성된 단일 모달, 교차 모달 및 멀티모달 검색을 포함한 주요 임베딩 벤치마크에서 최고 수준의 성능을 달성한다. 제안된 임베딩 모델은 다양한 작업에서 MSCOCO R@1 62.9, Vatex NDCG@10 68.8, MTEB 다국어 69.9, MTEB Code 84.0의 강력한 성능을 보여주며, 특화 모델들의 성능을 능가한다. 이러한 통합된 능력은 Gemini Embedding 2를 RAG, 추천 및 검색과 같은 다운스트림 사용 사례에 유망한 후보로 만든다. 또한, 천문학 및 생명과학에서부터 순수 예술 및 요리 예술에 이르기까지 여러 분야에서의 강력한 제로샷 성능은 특수 도메인에서도 높은 신뢰성을 지닌 즉시 사용 가능한 표현 모델로서의 입지를 확립한다.

English

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.