Qwen3-VL-Embedding与Qwen3-VL-Reranker:构建先进多模态检索与排序的统一框架
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
January 8, 2026
作者: Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin
cs.AI
摘要
本报告介绍了基于Qwen3-VL基础模型的最新扩展系列——Qwen3-VL-Embedding与Qwen3-VL-Reranker模型。这两个系列共同构建了高精度多模态搜索的端到端解决方案,能够将文本、图像、文档图像及视频等多种模态数据映射到统一的表示空间。Qwen3-VL-Embedding模型采用从大规模对比预训练到重排序模型蒸馏的多阶段训练范式,生成语义丰富的高维向量。该模型支持套娃表示学习(Matryoshka Representation Learning),可实现灵活的嵌入维度,并支持最高32K令牌的输入长度。与之互补的Qwen3-VL-Reranker模型则通过具有交叉注意力机制的跨编码器架构,对查询-文档对进行细粒度相关性评估。两个系列均继承了Qwen3-VL的多语言能力,支持超过30种语言,并发布2B和8B两种参数规模以适应不同部署需求。实证评估表明,Qwen3-VL-Embedding系列在多模态嵌入评估基准上取得了领先性能:Qwen3-VL-Embedding-8B在MMEB-V2基准上以77.8的综合得分位列榜首(截至2025年1月8日)。本报告详细阐述了该系列的架构设计、训练方法及实际应用能力,通过图像-文本检索、视觉问答和视频-文本匹配等多模态检索任务验证了其卓越性能。
English
In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in 2B and 8B parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.