# Qwen3.5-Omni 기술 보고서

초록

본 연구에서는 Qwen-Omni 모델 패밀리의 최신 발전판인 Qwen3.5-Omni를 소개합니다. 이전 모델 대비 획기적으로 진화한 Qwen3.5-Omni는 수천억 개의 매개변수 규모로 확장되었으며 256k 컨텍스트 길이를 지원합니다. 이질적인 텍스트-비전 쌍과 1억 시간 이상의 오디오-비주얼 콘텐츠로 구성된 대규모 데이터셋을 활용함으로써, 이 모델은 강력한 올모달리티 능력을 보여줍니다. Qwen3.5-Omni-plus는 215개의 오디오 및 오디오-비주얼 이해, 추론, 상호작용 하위 과제와 벤치마크에서 SOTA 성능을 달성하여, 주요 오디오 과제에서는 Gemini-3.1 Pro를 능가하고 종합적인 오디오-비주얼 이해에서는 동등한 성능을 보입니다. 구조적으로 Qwen3.5-Omni는 Thinker와 Talker 모두에 하이브리드 어텐션 Mixture-of-Experts(MoE) 프레임워크를 채택하여 효율적인 장문 시퀀스 추론을 가능하게 합니다. 본 모델은 10시간 이상의 오디오 이해와 400초 분량의 720P 동영상(1 FPS 기준) 처리를 지원하는 정교한 상호작용을 제공합니다. 스트리밍 음성 합성에서 흔히 발생하는, 텍스트와 음성 토크나이저 간 인코딩 효율성 차이로 인한固有的인 불안정성과 부자연스러움 문제를 해결하기 위해 ARIA를 도입했습니다. ARIA는 텍스트와 음성 단위를 동적으로 정렬하여 대화형 음성의 안정성과 운율을 최소의 지연 시간 영향으로 현저히 향상시킵니다. 더 나아가 Qwen3.5-Omni는 언어적 경계를 확장하여 10개 언어에 걸친 다국어 이해 및 음성 생성을 인간과 유사한 감정적 뉘앙스로 지원합니다. 마지막으로, Qwen3.5-Omni는 우수한 오디오-비주얼 기반 능력을 보여주며, 정확한 시간적 동기화와 자동화된 장면 분할을 통한 스크립트 수준의 구조화된 캡션을 생성합니다. 특히, 우리는 올모달 모델에서 새로운 능력의 출현을 관찰했는데, 바로 오디오-비주얼 지시를 기반으로 직접 코딩을 수행하는 '오디오-비주얼 바이브 코딩' 능력입니다.

English

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.

# Qwen3.5-Omni 기술 보고서

Qwen3.5-Omni Technical Report

초록

Support