ChatPaper.aiChatPaper

Qwen3-Omni技术报告

Qwen3-Omni Technical Report

September 22, 2025
作者: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin
cs.AI

摘要

我們推出Qwen3-Omni,這是一款首次實現了在文本、圖像、音頻和視頻領域均保持頂尖性能的多模態模型,且相較於單模態模型無任何性能下降。Qwen3-Omni在Qwen系列中與同規模單模態模型性能相當,尤其在音頻任務上表現卓越。在36個音頻及音視覺基準測試中,Qwen3-Omni於32個基準上達到了開源領域的SOTA(State-of-the-Art),並在22個基準上實現了總體SOTA,超越了如Gemini-2.5-Pro、Seed-ASR和GPT-4o-Transcribe等強大的閉源模型。Qwen3-Omni採用了Thinker-Talker MoE架構,統一了文本、圖像、音頻和視頻的感知與生成,產出流暢文本和自然實時語音。它支持119種語言的文本交互、19種語言的語音理解及10種語言的語音生成。為降低流式合成中的首包延遲,Talker通過多碼本方案自回歸預測離散語音編解碼。利用這些碼本的表示能力,我們以輕量級因果ConvNet替代計算密集的塊級擴散,實現從首個編解碼幀開始的流式處理。在冷啟動環境下,Qwen3-Omni理論上可實現234毫秒的端到端首包延遲。為進一步強化多模態推理,我們引入了Thinking模型,它能對來自任何模態的輸入進行顯式推理。鑑於研究界目前缺乏通用音頻描述模型,我們微調Qwen3-Omni-30B-A3B得到Qwen3-Omni-30B-A3B-Captioner,為任意音頻輸入生成詳盡、低幻覺的描述。Qwen3-Omni-30B-A3B、Qwen3-Omni-30B-A3B-Thinking及Qwen3-Omni-30B-A3B-Captioner已根據Apache 2.0許可公開釋出。
English
We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.
PDF1214September 23, 2025