Qwen3-VL技术报告
Qwen3-VL Technical Report
November 26, 2025
作者: Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu
cs.AI
摘要
我们推出Qwen3-VL——迄今为止Qwen系列中最强大的视觉语言模型,在广泛的多模态基准测试中实现了卓越性能。该模型原生支持高达256K令牌的交错上下文,无缝融合文本、图像与视频。模型家族包含稠密模型(2B/4B/8B/32B)和混合专家模型(30B-A3B/235B-A22B)变体,以适应不同的延迟-质量权衡需求。Qwen3-VL具备三大核心支柱:(一)显著增强的纯文本理解能力,在多项测试中超越同规模纯文本基座模型;(二)基于原生256K令牌窗口的强大长上下文理解能力,可对长文档和视频实现精准的信息保持、检索与交叉引用;(三)在单图、多图及视频任务中展现先进的多模态推理能力,在MMMU综合评估及视觉数学基准(如MathVista和MathVision)中保持领先地位。架构层面我们实现三大关键升级:(一)增强型交错MRoPE机制,强化图像与视频的时空建模能力;(二)集成DeepStack技术,通过多层级ViT特征提升视觉-语言对齐效果;(三)基于文本的视频时间对齐机制,从T-RoPE演进为显式时间戳文本对齐,实现更精准的时间定位。在可比令牌预算和延迟约束下,Qwen3-VL在稠密与混合专家架构中均展现出优越性能。我们期待Qwen3-VL成为现实场景中图像推理、智能体决策和多模态代码智能的基础引擎。
English
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.