Qwen3-VL-Embedding與Qwen3-VL-Reranker:實現頂尖多模態檢索與排序的統一框架
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
January 8, 2026
作者: Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin
cs.AI
摘要
本報告介紹基於Qwen3-VL基礎模型構建的Qwen3-VL-Embedding與Qwen3-VL-Reranker模型系列,作為Qwen家族的最新擴展。這兩大模型系列通過將文本、圖像、文檔圖像及影片等多模態數據映射至統一表徵空間,構建了高精度多模態搜索的端到端管線。Qwen3-VL-Embedding模型採用從大規模對比預訓練到重排序模型蒸餾的多階段訓練範式,生成語義豐富的高維向量。其支持套娃表徵學習技術,可實現靈活的嵌入維度,並能處理長達32k標記的輸入。與之互補的Qwen3-VL-Reranker模型則基於交叉注意力機制的跨編碼器架構,對查詢-文檔對進行細粒度相關性評估。兩大系列均繼承Qwen3-VL的多語言能力,支持超過30種語言,並提供2B與8B兩種參數規模以適應不同部署需求。實證評估表明,Qwen3-VL-Embedding系列在多模態嵌入評測基準中全面達到業界頂尖水平,其中Qwen3-VL-Embedding-8B在MMEB-V2基準上以77.8的綜合得分位居榜首(截至2025年1月8日)。本報告詳述了該系列的架構設計、訓練方法與實用能力,並通過圖文檢索、視覺問答及影片-文本匹配等多模態檢索任務驗證其卓越效能。
English
In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in 2B and 8B parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.