Qwen3嵌入:通过基础模型推进文本嵌入与重排序技术
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
June 5, 2025
作者: Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou
cs.AI
摘要
在本研究中,我們推出了Qwen3 Embedding系列,這是在Qwen3基礎模型之上,對其前身GTE-Qwen系列在文本嵌入與重排序能力上的重大提升。借助Qwen3大型語言模型在多語言文本理解與生成方面的強大能力,我們創新的多階段訓練管道結合了大規模無監督預訓練與高質量數據集上的有監督微調。有效的模型融合策略進一步確保了Qwen3 Embedding系列的魯棒性與適應性。在訓練過程中,Qwen3大型語言模型不僅作為骨幹模型,還在合成跨多領域與多語言的高質量、豐富且多樣的訓練數據方面發揮了關鍵作用,從而增強了訓練管道。Qwen3 Embedding系列提供了多種模型規模(0.6B、4B、8B)以應對嵌入與重排序任務,滿足用戶在效率或效果上進行優化的多樣化部署場景。實證評估顯示,Qwen3 Embedding系列在多樣化的基準測試中達到了最先進的成果。特別是在多語言評估基準MTEB上的文本嵌入表現,以及在包括代碼檢索、跨語言檢索與多語言檢索在內的各種檢索任務中均表現卓越。為了促進可重複性並推動社區驅動的研究與開發,Qwen3 Embedding模型在Apache 2.0許可下公開提供。
English
In this work, we introduce the Qwen3 Embedding series, a significant
advancement over its predecessor, the GTE-Qwen series, in text embedding and
reranking capabilities, built upon the Qwen3 foundation models. Leveraging the
Qwen3 LLMs' robust capabilities in multilingual text understanding and
generation, our innovative multi-stage training pipeline combines large-scale
unsupervised pre-training with supervised fine-tuning on high-quality datasets.
Effective model merging strategies further ensure the robustness and
adaptability of the Qwen3 Embedding series. During the training process, the
Qwen3 LLMs serve not only as backbone models but also play a crucial role in
synthesizing high-quality, rich, and diverse training data across multiple
domains and languages, thus enhancing the training pipeline. The Qwen3
Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both
embedding and reranking tasks, addressing diverse deployment scenarios where
users can optimize for either efficiency or effectiveness. Empirical
evaluations demonstrate that the Qwen3 Embedding series achieves
state-of-the-art results across diverse benchmarks. Notably, it excels on the
multilingual evaluation benchmark MTEB for text embedding, as well as in
various retrieval tasks, including code retrieval, cross-lingual retrieval and
multilingual retrieval. To facilitate reproducibility and promote
community-driven research and development, the Qwen3 Embedding models are
publicly available under the Apache 2.0 license.