Qwen3-VL技术报告
Qwen3-VL Technical Report
November 26, 2025
作者: Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu
cs.AI
摘要
我們推出Qwen3-VL——迄今為止Qwen系列中最強大的視覺語言模型,在廣泛的多模態基準測試中均展現卓越性能。該模型原生支援高達256K令牌的交錯上下文,無縫整合文字、圖像與影片。該模型系列包含稠密模型(2B/4B/8B/32B)與專家混合模型(30B-A3B/235B-A22B)兩種架構,以滿足不同延遲與品質的權衡需求。Qwen3-VL具備三大核心優勢:(一)顯著增強的純文字理解能力,在多項測試中超越同級純文字基礎模型;(二)具備原生256K令牌視窗的強大長上下文理解力,可對長文檔與影片實現精準的內容保留、檢索與交叉引用;(三)在單圖像、多圖像及影片任務中展現先進的多模態推理能力,於MMMU等綜合評估及視覺數學基準(如MathVista、MathVision)中保持領先表現。在架構層面,我們實現三大關鍵升級:(一)增強型交錯MRoPE技術,強化圖像與影片的時空建模能力;(二)整合DeepStack架構,透過多層級ViT特徵緊密對齊視覺與語言表徵;(三)基於文字的時間對齊機制,從T-RoPE演進為顯式時間戳文字對齊,提升影片時間定位精度。在可比令牌預算與延遲限制下,Qwen3-VL於稠密與MoE架構中均實現最優性能。我們期許Qwen3-VL成為現實工作流程中圖像推理、智能決策與多模態程式碼理解的基礎引擎。
English
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.