Qwen3-Omni技术报告

摘要

我们推出Qwen3-Omni，这是首个在文本、图像、音频和视频多模态任务中均保持顶尖性能，且相较于单模态模型无任何性能下降的统一多模态模型。Qwen3-Omni在Qwen系列同规模单模态模型的基础上，尤其在音频任务上表现卓越。在36项音频及音视频基准测试中，Qwen3-Omni在32项上实现了开源领域的最优成绩（SOTA），并在22项上整体领先，超越了如Gemini-2.5-Pro、Seed-ASR和GPT-4o-Transcribe等强大的闭源模型。Qwen3-Omni采用Thinker-Talker混合专家（MoE）架构，统一了跨文本、图像、音频和视频的感知与生成能力，实现了流畅的文本输出和自然的实时语音合成。它支持119种语言的文本交互、19种语言的语音理解及10种语言的语音生成。为降低流式合成中的首包延迟，Talker通过多码本方案自回归预测离散语音编解码。利用这些码本的表征能力，我们以轻量级因果卷积网络替代了计算密集型的块级扩散，实现了从首个编解码帧开始的流式处理。在冷启动环境下，Qwen3-Omni理论上的端到端首包延迟低至234毫秒。为进一步增强多模态推理能力，我们引入了Thinking模型，它能够对来自任何模态的输入进行显式推理。鉴于研究界目前缺乏通用音频描述模型，我们对Qwen3-Omni-30B-A3B进行了微调，得到Qwen3-Omni-30B-A3B-Captioner，该模型能为任意音频输入生成详细且低幻觉的描述。Qwen3-Omni-30B-A3B、Qwen3-Omni-30B-A3B-Thinking及Qwen3-Omni-30B-A3B-Captioner已根据Apache 2.0许可证公开发布。

English

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Qwen3-Omni技术报告

Qwen3-Omni Technical Report

摘要

Support