Qwen3 임베딩: 파운데이션 모델을 통한 텍스트 임베딩 및 리랭킹 기술의 발전

초록

본 연구에서는 Qwen3 기반 모델을 기반으로 텍스트 임베딩 및 리랭킹 기능에서 이전 버전인 GTE-Qwen 시리즈를 크게 개선한 Qwen3 임베딩 시리즈를 소개합니다. Qwen3 대형 언어 모델(LLM)의 다국어 텍스트 이해 및 생성 능력을 활용하여, 우리는 대규모 비지도 사전 학습과 고품질 데이터셋에 대한 지도 미세 조정을 결합한 혁신적인 다단계 학습 파이프라인을 개발했습니다. 효과적인 모델 병합 전략은 Qwen3 임베딩 시리즈의 견고성과 적응성을 더욱 보장합니다. 학습 과정에서 Qwen3 LLM은 백본 모델로 사용될 뿐만 아니라, 다양한 도메인과 언어에 걸쳐 고품질의 풍부하고 다양한 학습 데이터를 합성하는 데 중요한 역할을 하여 학습 파이프라인을 강화합니다. Qwen3 임베딩 시리즈는 임베딩 및 리랭킹 작업을 위해 다양한 모델 크기(0.6B, 4B, 8B)를 제공하여 사용자가 효율성 또는 효과성을 최적화할 수 있는 다양한 배포 시나리오를 해결합니다. 실험적 평가 결과, Qwen3 임베딩 시리즈는 다양한 벤치마크에서 최첨단 성능을 달성함을 보여줍니다. 특히, 텍스트 임베딩을 위한 다국어 평가 벤치마크 MTEB에서 우수한 성능을 보이며, 코드 검색, 교차 언어 검색 및 다국어 검색을 포함한 다양한 검색 작업에서도 뛰어난 성과를 거두었습니다. 재현성을 촉진하고 커뮤니티 주도의 연구 및 개발을 장려하기 위해 Qwen3 임베딩 모델은 Apache 2.0 라이선스 하에 공개되었습니다.

English

In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.

Qwen3 임베딩: 파운데이션 모델을 통한 텍스트 임베딩 및 리랭킹 기술의 발전

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

초록

Support