Qwen3-Omni技术报告
Qwen3-Omni Technical Report
September 22, 2025
作者: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin
cs.AI
摘要
我们推出Qwen3-Omni,这是首个在文本、图像、音频和视频多模态任务中均保持顶尖性能,且相较于单模态模型无任何性能下降的统一多模态模型。Qwen3-Omni在Qwen系列同规模单模态模型的基础上,尤其在音频任务上表现卓越。在36项音频及音视频基准测试中,Qwen3-Omni在32项上实现了开源领域的最优成绩(SOTA),并在22项上整体领先,超越了如Gemini-2.5-Pro、Seed-ASR和GPT-4o-Transcribe等强大的闭源模型。Qwen3-Omni采用Thinker-Talker混合专家(MoE)架构,统一了跨文本、图像、音频和视频的感知与生成能力,实现了流畅的文本输出和自然的实时语音合成。它支持119种语言的文本交互、19种语言的语音理解及10种语言的语音生成。为降低流式合成中的首包延迟,Talker通过多码本方案自回归预测离散语音编解码。利用这些码本的表征能力,我们以轻量级因果卷积网络替代了计算密集型的块级扩散,实现了从首个编解码帧开始的流式处理。在冷启动环境下,Qwen3-Omni理论上的端到端首包延迟低至234毫秒。为进一步增强多模态推理能力,我们引入了Thinking模型,它能够对来自任何模态的输入进行显式推理。鉴于研究界目前缺乏通用音频描述模型,我们对Qwen3-Omni-30B-A3B进行了微调,得到Qwen3-Omni-30B-A3B-Captioner,该模型能为任意音频输入生成详细且低幻觉的描述。Qwen3-Omni-30B-A3B、Qwen3-Omni-30B-A3B-Thinking及Qwen3-Omni-30B-A3B-Captioner已根据Apache 2.0许可证公开发布。
English
We present Qwen3-Omni, a single multimodal model that, for the first time,
maintains state-of-the-art performance across text, image, audio, and video
without any degradation relative to single-modal counterparts. Qwen3-Omni
matches the performance of same-sized single-modal models within the Qwen
series and excels particularly on audio tasks. Across 36 audio and audio-visual
benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall
SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro,
Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE
architecture that unifies perception and generation across text, images, audio,
and video, yielding fluent text and natural real-time speech. It supports text
interaction in 119 languages, speech understanding in 19 languages, and speech
generation in 10 languages. To reduce first-packet latency in streaming
synthesis, Talker autoregressively predicts discrete speech codecs using a
multi-codebook scheme. Leveraging the representational capacity of these
codebooks, we replace computationally intensive block-wise diffusion with a
lightweight causal ConvNet, enabling streaming from the first codec frame. In
cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet
latency of 234 ms. To further strengthen multimodal reasoning, we introduce a
Thinking model that explicitly reasons over inputs from any modality. Since the
research community currently lacks a general-purpose audio captioning model, we
fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which
produces detailed, low-hallucination captions for arbitrary audio inputs.
Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and
Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0
license.