Qwen3-Omni 기술 보고서

초록

우리는 Qwen3-Omni를 소개합니다. 이는 텍스트, 이미지, 오디오, 비디오 전 영역에서 단일 모달리티 모델 대비 성능 저하 없이 최첨단 성능을 유지하는 최초의 단일 멀티모달 모델입니다. Qwen3-Omni는 Qwen 시리즈 내 동일 규모의 단일 모달리티 모델들과 동등한 성능을 보이며, 특히 오디오 작업에서 탁월한 성과를 거둡니다. 36개의 오디오 및 오디오-비주얼 벤치마크에서 Qwen3-Omni는 32개 벤치마크에서 오픈소스 SOTA(State-of-the-Art)를 달성하고, 전체적으로 22개 벤치마크에서 SOTA를 기록하며, Gemini-2.5-Pro, Seed-ASR, GPT-4o-Transcribe와 같은 강력한 클로즈드소스 모델들을 능가합니다. Qwen3-Omni는 텍스트, 이미지, 오디오, 비디오 전 영역에 걸쳐 인지와 생성을 통합하는 Thinker-Talker MoE 아키텍처를 채택하여 유창한 텍스트와 자연스러운 실시간 음성을 생성합니다. 이 모델은 119개 언어의 텍스트 상호작용, 19개 언어의 음성 이해, 10개 언어의 음성 생성을 지원합니다. 스트리밍 합성에서 첫 패킷 지연 시간을 줄이기 위해 Talker는 멀티 코드북 방식을 사용하여 이산 음성 코덱을 자동회귀적으로 예측합니다. 이러한 코드북의 표현력을 활용하여 계산 집약적인 블록 단위 확산을 경량화된 인과적 ConvNet으로 대체함으로써 첫 코덱 프레임부터 스트리밍이 가능하도록 했습니다. 콜드 스타트 설정에서 Qwen3-Omni는 이론적으로 234ms의 종단 간 첫 패킷 지연 시간을 달성합니다. 멀티모달 추론을 더욱 강화하기 위해, 우리는 모든 모달리티의 입력에 대해 명시적으로 추론하는 Thinking 모델을 도입했습니다. 현재 연구 커뮤니티에는 범용 오디오 캡셔닝 모델이 부족한 상황을 고려하여, 우리는 Qwen3-Omni-30B-A3B를 미세 조정하여 Qwen3-Omni-30B-A3B-Captioner를 개발했습니다. 이 모델은 임의의 오디오 입력에 대해 상세하고 낮은 환각(hallucination) 수준의 캡션을 생성합니다. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, 그리고 Qwen3-Omni-30B-A3B-Captioner는 Apache 2.0 라이선스 하에 공개되었습니다.

English

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Qwen3-Omni 기술 보고서

Qwen3-Omni Technical Report

초록

Support