多语言狮城:基于Qwen3-ASR平衡微调的新加坡高效多语种语音识别系统
Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR
March 17, 2026
作者: Quy-Anh Dang, Chris Ngo
cs.AI
摘要
我们推出Polyglot-Lion系列紧凑型多语言自动语音识别(ASR)模型,专为新加坡多语言场景设计,涵盖英语、华语、泰米尔语和马来语。该模型通过仅使用公开语音语料库对Qwen3-ASR-0.6B和Qwen3-ASR-1.7B进行微调获得,采用平衡采样策略确保各语言训练语句数量均等,并刻意省略语言标签条件机制,使模型能够从音频中隐式识别语言。在涵盖四种目标语言的12个基准测试中,Polyglot-Lion-1.7B实现了14.85的平均错误率,与体积大6倍的MERaLiON-2-10B-ASR模型(14.32)性能相当,而训练成本仅需单张RTX PRO 6000 GPU耗费81美元,远低于128卡基准模型的18,862美元。推理吞吐量达到0.10秒/样本,较MERaLiON的2.02秒/样本提升约20倍。这些结果表明,对中等规模预训练模型进行语言平衡的微调,能以极低成本获得可投入实际应用的多语言ASR系统,其成本远低于大型专业系统。
English
We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \81 on a single RTX PRO 6000 GPU compared to 18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.