Qwen3-Omni Technisch Rapport

Samenvatting

We presenteren Qwen3-Omni, een enkel multimodaal model dat voor het eerst state-of-the-art prestaties behoudt over tekst, beeld, audio en video zonder enige achteruitgang ten opzichte van enkelmodale tegenhangers. Qwen3-Omni evenaart de prestaties van enkelmodale modellen van dezelfde grootte binnen de Qwen-serie en blinkt vooral uit op audiotaken. Over 36 audio en audio-visuele benchmarks behaalt Qwen3-Omni open-source SOTA op 32 benchmarks en overall SOTA op 22, waarbij het sterke closed-source modellen zoals Gemini-2.5-Pro, Seed-ASR en GPT-4o-Transcribe overtreft. Qwen3-Omni maakt gebruik van een Thinker-Talker MoE-architectuur die perceptie en generatie over tekst, beeld, audio en video verenigt, wat resulteert in vloeiende tekst en natuurlijke real-time spraak. Het ondersteunt tekstinteractie in 119 talen, spraakbegrip in 19 talen en spraakgeneratie in 10 talen. Om de eerste-pakketvertraging in streaming-synthese te verminderen, voorspelt Talker autoregressief discrete spraakcodecs met behulp van een multi-codebook-schema. Door gebruik te maken van de representatiecapaciteit van deze codebooks, vervangen we rekenintensieve block-wise diffusie door een lichtgewicht causaal ConvNet, wat streaming vanaf het eerste codec-frame mogelijk maakt. In cold-start-omgevingen bereikt Qwen3-Omni een theoretische end-to-end eerste-pakketvertraging van 234 ms. Om multimodale redenering verder te versterken, introduceren we een Thinking-model dat expliciet redeneert over invoer van elke modaliteit. Aangezien de onderzoeksgemeenschap momenteel geen algemeen model voor audiobeschrijving heeft, hebben we Qwen3-Omni-30B-A3B afgestemd om Qwen3-Omni-30B-A3B-Captioner te verkrijgen, dat gedetailleerde, hallucinatiearme beschrijvingen produceert voor willekeurige audio-invoer. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking en Qwen3-Omni-30B-A3B-Captioner zijn openbaar beschikbaar gesteld onder de Apache 2.0-licentie.

English

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Qwen3-Omni Technisch Rapport

Qwen3-Omni Technical Report

Samenvatting

Support