Qwen3-Omni技术报告

摘要

我們推出Qwen3-Omni，這是一款首次實現了在文本、圖像、音頻和視頻領域均保持頂尖性能的多模態模型，且相較於單模態模型無任何性能下降。Qwen3-Omni在Qwen系列中與同規模單模態模型性能相當，尤其在音頻任務上表現卓越。在36個音頻及音視覺基準測試中，Qwen3-Omni於32個基準上達到了開源領域的SOTA（State-of-the-Art），並在22個基準上實現了總體SOTA，超越了如Gemini-2.5-Pro、Seed-ASR和GPT-4o-Transcribe等強大的閉源模型。Qwen3-Omni採用了Thinker-Talker MoE架構，統一了文本、圖像、音頻和視頻的感知與生成，產出流暢文本和自然實時語音。它支持119種語言的文本交互、19種語言的語音理解及10種語言的語音生成。為降低流式合成中的首包延遲，Talker通過多碼本方案自回歸預測離散語音編解碼。利用這些碼本的表示能力，我們以輕量級因果ConvNet替代計算密集的塊級擴散，實現從首個編解碼幀開始的流式處理。在冷啟動環境下，Qwen3-Omni理論上可實現234毫秒的端到端首包延遲。為進一步強化多模態推理，我們引入了Thinking模型，它能對來自任何模態的輸入進行顯式推理。鑑於研究界目前缺乏通用音頻描述模型，我們微調Qwen3-Omni-30B-A3B得到Qwen3-Omni-30B-A3B-Captioner，為任意音頻輸入生成詳盡、低幻覺的描述。Qwen3-Omni-30B-A3B、Qwen3-Omni-30B-A3B-Thinking及Qwen3-Omni-30B-A3B-Captioner已根據Apache 2.0許可公開釋出。

English

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Qwen3-Omni技术报告

Qwen3-Omni Technical Report

摘要

Support