ChatPaper.aiChatPaper

Qwen3-TTS技术报告

Qwen3-TTS Technical Report

January 22, 2026
作者: Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin
cs.AI

摘要

本报告正式推出Qwen3-TTS系列模型,这是一组具备多语言支持、可控性强、鲁棒性优异且支持流式生成的先进文本转语音模型。该系列实现了业界领先的3秒语音克隆与描述性控制功能,既能生成全新音色,也可对输出语音进行细粒度调节。基于覆盖10种语言、总时长超500万小时的语音数据训练,Qwen3-TTS采用双轨语言模型架构实现实时合成,并配备两款语音分词器:1)Qwen-TTS-Tokenizer-25Hz作为单码本编解码器侧重语义内容表征,可与Qwen-Audio无缝集成,通过分块式DiT实现流式波形重建;2)Qwen-TTS-Tokenizer-12Hz采用12.5Hz频率的16层多码本设计与轻量因果卷积网络,在实现极致码率压缩的同时支持超低延迟流式生成,首包响应时间达97毫秒。大量实验表明,该系列在多项主客观评测基准(如多语言TTS测试集、InstructTTSEval及长语音测试集)中均达到业界最优水平。为促进社区研发,我们已将全部分词器与模型基于Apache 2.0协议开源发布。
English
In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission (97,ms) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
PDF200January 24, 2026