ChatPaper.aiChatPaper

Qwen3-ASR技术报告

Qwen3-ASR Technical Report

January 29, 2026
作者: Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin
cs.AI

摘要

本报告推出Qwen3-ASR系列模型,包含两款全能型语音识别模型和创新的非自回归语音强制对齐模型。Qwen3-ASR-1.7B与Qwen3-ASR-0.6B作为语音识别模型,支持52种语言与方言的语种识别及语音转写。两款模型均基于大规模语音训练数据,充分发挥基础模型Qwen3-Omni强大的音频理解能力。除开源基准测试外,我们还进行了全面的内部评估——因为语音识别模型在开源基准上的得分可能相差无几,但在实际场景中却表现出显著的质量差异。实验表明:1.7B版本在开源语音识别模型中达到SOTA性能,与最强商业API持平;0.6B版本则实现了最佳的精度-效率平衡,其平均首字延迟可低至92毫秒,在128并发下仅需1秒即可完成2000秒语音转写。Qwen3-ForcedAligner-0.6B是基于大语言模型的非自回归时间戳预测器,可对11种语言的文本-语音进行对齐。时间戳精度实验证明,该模型在三大核心指标上超越现有最强强制对齐模型,并在效率与多语言适应性方面优势显著。为加速语音识别与音频理解领域的社区研究,我们以Apache 2.0协议开源这些模型。
English
In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.
PDF193January 31, 2026