Qwen3-Omni技術報告書

要旨

本論文では、Qwen3-Omniを紹介する。これは、テキスト、画像、音声、動画の各モダリティにおいて、単一モダリティモデルと比較しても性能の劣化を伴わずに、初めて最先端の性能を維持する単一のマルチモーダルモデルである。Qwen3-Omniは、Qwenシリーズ内の同規模の単一モダリティモデルと同等の性能を発揮し、特に音声タスクにおいて優れた結果を示す。36の音声および音声視覚ベンチマークにおいて、Qwen3-Omniは32のベンチマークでオープンソースSOTAを達成し、全体で22のベンチマークでSOTAを記録し、Gemini-2.5-Pro、Seed-ASR、GPT-4o-Transcribeといった強力なクローズドソースモデルを上回った。Qwen3-Omniは、テキスト、画像、音声、動画にわたる知覚と生成を統合するThinker-Talker MoEアーキテクチャを採用し、流暢なテキストと自然なリアルタイム音声を生成する。119言語でのテキストインタラクション、19言語での音声理解、10言語での音声生成をサポートする。ストリーミング合成における初回パケットの遅延を低減するため、Talkerはマルチコードブック方式を用いて離散音声コーデックを自己回帰的に予測する。これらのコードブックの表現能力を活用し、計算集約的なブロック単位の拡散を軽量な因果的ConvNetに置き換えることで、初回コーデックフレームからのストリーミングを可能にした。コールドスタート設定において、Qwen3-Omniは理論的なエンドツーエンド初回パケット遅延234 msを達成する。マルチモーダル推論をさらに強化するため、任意のモダリティからの入力に対して明示的に推論を行うThinkingモデルを導入した。現在、研究コミュニティには汎用の音声キャプショニングモデルが存在しないため、Qwen3-Omni-30B-A3Bを微調整してQwen3-Omni-30B-A3B-Captionerを取得し、任意の音声入力に対して詳細で幻覚の少ないキャプションを生成する。Qwen3-Omni-30B-A3B、Qwen3-Omni-30B-A3B-Thinking、およびQwen3-Omni-30B-A3B-Captionerは、Apache 2.0ライセンスの下で公開されている。

English

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Qwen3-Omni技術報告書

Qwen3-Omni Technical Report

要旨

Support