Qwen3嵌入:通过基础模型推进文本嵌入与重排序技术
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
June 5, 2025
作者: Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou
cs.AI
摘要
在本研究中,我们推出了Qwen3 Embedding系列,这一系列在Qwen3基础模型之上构建,相较于前代GTE-Qwen系列,在文本嵌入与重排序能力上实现了显著提升。依托Qwen3大语言模型在多语言文本理解与生成方面的强大能力,我们创新的多阶段训练流程结合了大规模无监督预训练与高质量数据集上的有监督微调。有效的模型融合策略进一步确保了Qwen3 Embedding系列的鲁棒性与适应性。在训练过程中,Qwen3大语言模型不仅作为骨干模型,还在跨领域、跨语言合成高质量、丰富多样的训练数据方面发挥了关键作用,从而优化了训练流程。Qwen3 Embedding系列提供了多种模型规模(0.6B、4B、8B)以适应嵌入与重排序任务,满足用户在不同部署场景下对效率或效果优化的需求。实证评估表明,Qwen3 Embedding系列在多样化基准测试中均达到了业界领先水平,尤其在多语言文本嵌入评估基准MTEB上表现卓越,同时在代码检索、跨语言检索及多语言检索等多种检索任务中亦展现出优异性能。为促进研究的可重复性并推动社区驱动的研发,Qwen3 Embedding模型已依据Apache 2.0许可证公开发布。
English
In this work, we introduce the Qwen3 Embedding series, a significant
advancement over its predecessor, the GTE-Qwen series, in text embedding and
reranking capabilities, built upon the Qwen3 foundation models. Leveraging the
Qwen3 LLMs' robust capabilities in multilingual text understanding and
generation, our innovative multi-stage training pipeline combines large-scale
unsupervised pre-training with supervised fine-tuning on high-quality datasets.
Effective model merging strategies further ensure the robustness and
adaptability of the Qwen3 Embedding series. During the training process, the
Qwen3 LLMs serve not only as backbone models but also play a crucial role in
synthesizing high-quality, rich, and diverse training data across multiple
domains and languages, thus enhancing the training pipeline. The Qwen3
Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both
embedding and reranking tasks, addressing diverse deployment scenarios where
users can optimize for either efficiency or effectiveness. Empirical
evaluations demonstrate that the Qwen3 Embedding series achieves
state-of-the-art results across diverse benchmarks. Notably, it excels on the
multilingual evaluation benchmark MTEB for text embedding, as well as in
various retrieval tasks, including code retrieval, cross-lingual retrieval and
multilingual retrieval. To facilitate reproducibility and promote
community-driven research and development, the Qwen3 Embedding models are
publicly available under the Apache 2.0 license.