Qwen3嵌入：通过基础模型推进文本嵌入与重排序技术

摘要

在本研究中，我們推出了Qwen3 Embedding系列，這是在Qwen3基礎模型之上，對其前身GTE-Qwen系列在文本嵌入與重排序能力上的重大提升。借助Qwen3大型語言模型在多語言文本理解與生成方面的強大能力，我們創新的多階段訓練管道結合了大規模無監督預訓練與高質量數據集上的有監督微調。有效的模型融合策略進一步確保了Qwen3 Embedding系列的魯棒性與適應性。在訓練過程中，Qwen3大型語言模型不僅作為骨幹模型，還在合成跨多領域與多語言的高質量、豐富且多樣的訓練數據方面發揮了關鍵作用，從而增強了訓練管道。Qwen3 Embedding系列提供了多種模型規模（0.6B、4B、8B）以應對嵌入與重排序任務，滿足用戶在效率或效果上進行優化的多樣化部署場景。實證評估顯示，Qwen3 Embedding系列在多樣化的基準測試中達到了最先進的成果。特別是在多語言評估基準MTEB上的文本嵌入表現，以及在包括代碼檢索、跨語言檢索與多語言檢索在內的各種檢索任務中均表現卓越。為了促進可重複性並推動社區驅動的研究與開發，Qwen3 Embedding模型在Apache 2.0許可下公開提供。

English

In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.