## Qwen3.5-Omni技术报告

摘要

在本研究中，我们推出Qwen3.5-Omni——通义千问Omni模型家族的最新进展。作为前代模型的重大升级，Qwen3.5-Omni的参数规模扩展至数千亿级别，并支持256K上下文长度。通过融合包含异构图文对数据及超1亿小时音视频内容的大规模训练集，该模型展现出强大的全模态能力。Qwen3.5-Omni-plus在215项音频与音视频理解、推理及交互子任务和基准测试中取得SOTA成果，在关键音频任务上超越Gemini-3.1 Pro，在综合音视频理解方面与之持平。架构层面，Qwen3.5-Omni采用混合注意力专家混合（MoE）框架统筹思考与输出模块，实现高效长序列推理。该模型支持复杂交互场景，可处理超10小时音频理解任务及400秒720P视频（1帧/秒）。针对流式语音合成中因文本与语音分词器编码效率差异导致的固有不稳定性和非自然度问题，我们提出ARIA动态对齐机制。该技术通过实时协调文本与语音单元，在几乎不影响延迟的前提下显著提升对话语音的稳定性和韵律自然度。此外，Qwen3.5-Omni突破语言边界，支持10种语言的多语言理解与语音生成，并能呈现类人情感韵律。最终，该模型展现出卓越的音视频 grounding 能力，可生成具有精确时间同步性和自动场景分割的剧本级结构化描述。值得注意的是，我们观察到全模态模型涌现出新能力：基于音视频指令直接执行编程任务，我们将其命名为"音视频沉浸式编程（Audio-Visual Vibe Coding）"。

English

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.

## Qwen3.5-Omni技术报告

Qwen3.5-Omni Technical Report

摘要

Support