Qwen3.5-Omni Technisch Rapport

Samenvatting

In dit werk presenteren wij Qwen3.5-Omni, de nieuwste ontwikkeling in de Qwen-Omni-modelfamilie. Deze versie vertegenwoordigt een significante evolutie ten opzichte van zijn voorganger; Qwen3.5-Omni schaalt naar honderden miljarden parameters en ondersteunt een contextlengte van 256k. Door gebruik te maken van een enorme dataset bestaande uit heterogene tekst-visie-paren en meer dan 100 miljoen uur aan audiovisuele inhoud, toont het model robuuste omnimodaliteitscapaciteiten. Qwen3.5-Omni-plus behaalt state-of-the-art (SOTA) resultaten op 215 audio- en audiovisuele begrips-, redeneer- en interactiesubtaken en benchmarks, waarbij het Gemini-3.1 Pro verslaat in cruciale audiotaken en ermee gelijk opgaat in uitgebreid audiovisueel begrip. Architectonisch gezien gebruikt Qwen3.5-Omni een Hybride Aandacht Mixture-of-Experts (MoE) raamwerk voor zowel de 'Thinker' als de 'Talker', wat efficiënte inferentie voor lange sequenties mogelijk maakt. Het model faciliteert geavanceerde interactie, met ondersteuning voor meer dan 10 uur aan audiobegrip en 400 seconden 720P video (op 1 FPS). Om de inherente instabiliteit en onnatuurlijkheid in streaming spraaksynthese aan te pakken – vaak veroorzaakt door encoderings-efficiëntieverschillen tussen tekst- en spraak-tokenizers – introduceren we ARIA. ARIA aligneert dynamisch tekst- en spraakeenheden, wat de stabiliteit en prosodie van conversatiespraak aanzienlijk verbetert met minimale impact op de latentie. Bovendien verlegt Qwen3.5-Omni linguïstische grenzen door meertalig begrip en spraakgeneratie in 10 talen te ondersteunen, met menselijke emotionele nuance. Ten slotte vertoont Qwen3.5-Omni superieure audiovisuele 'grounding'-capaciteiten, waarbij het scriptniveau gestructureerde bijschriften genereert met precieze temporele synchronisatie en geautomatiseerde scènesegmentatie. Opmerkelijk genoeg observeerden we het opkomen van een nieuwe capaciteit in omnimodale modellen: direct code uitvoeren op basis van audiovisuele instructies, wat wij Audio-Visuele Vibe Coding noemen.

English

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.

Qwen3.5-Omni Technisch Rapport

Qwen3.5-Omni Technical Report

Samenvatting

Support