Qwen3.5-Omni 技術レポート

要旨

本論文では、Qwen-Omniモデルファミリーの最新進化形であるQwen3.5-Omniを提案する。前世代モデルから大幅に進化した本モデルは、数千億パラメータ規模にスケーリングし、256Kトークンのコンテキスト長をサポートする。テキスト-視覚の異種ペアデータと1億時間以上の音声-視覚コンテンツから構成される大規模データセットを活用することで、強力なオムニモーダル能力を実現している。Qwen3.5-Omni-plusは、215の音声・音声-視覚理解・推論・対話サブタスクおよびベンチマークにおいてSOTA結果を達成し、主要音声タスクではGemini-3.1 Proを上回り、総合的な音声-視覚理解では同等の性能を示す。アーキテクチャ面では、Qwen3.5-OmniはThinkerとTalkerの両方にHybrid Attention Mixture-of-Experts（MoE）フレームワークを採用し、効率的な長系列推論を可能にしている。本モデルは高度な対話を実現し、10時間以上の音声理解と720P解像度（1FPS）での400秒動画処理をサポートする。ストリーミング音声合成における、テキストと音声トークナイザーの符号化効率差に起因する不安定性や不自然さの問題に対処するため、ARIAを新規導入した。ARIAはテキストと音声単位を動的に調整し、遅延影響を最小限に抑えつつ会話音声の安定性と韻律を大幅に改善する。さらにQwen3.5-Omniは言語的境界を拡張し、10言語にわたる多言語理解と人間らしい情感ニュアンスを含む音声生成を実現する。最後に、本モデルは優れた音声-視覚グラウンディング能力を示し、正確な時間同期と自動シーン分割によるスクリプトレベルの構造化キャプション生成が可能である。特筆すべきは、オムニモーダルモデルにおいて新たな能力が創発した点である：音声-視覚指示に基づく直接的なコーディング実行能力を、我々はAudio-Visual Vibe Codingと命名する。

English

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.

Qwen3.5-Omni 技術レポート

Qwen3.5-Omni Technical Report

要旨

Support