ポリグロット・ライオン：Qwen3-ASRのバランス調整によるシンガポール向け高効率多言語音声認識

要旨

本論文では、シンガポールの言語環境（英語、中国語（北京語）、タミル語、マレー語）に特化したコンパクトな多言語自動音声認識（ASR）モデル群、Polyglot-Lionを提案する。我々のモデルは、Qwen3-ASR-0.6BおよびQwen3-ASR-1.7Bを、公開されている音声コーパスのみを用いてファインチューニングすることで得られた。学習では、各言語のトレーニング発話数を均等化するバランスサンプリング戦略を採用し、意図的に言語タグ条件付けを省略することで、モデルが音声から暗黙的に言語を識別することを学習させるようにした。4つの対象言語にわたる12のベンチマークにおいて、Polyglot-Lion-1.7Bは平均誤り率14.85を達成し、モデルサイズが6倍大きいMERaLiON-2-10B-ASR（14.32）と競合する性能を示した。一方、学習コストは、128GPUベースラインの18,862（通貨単位）に対し、単一のRTX PRO 6000 GPU上で81に抑えられた。推論スループットは、MERaLiONが2.02秒/サンプルであるのに対し、約20倍速い0.10秒/サンプルを実現している。これらの結果は、中規模の事前学習モデルを言語的にバランスよくファインチューニングすることで、大規模な専門システムに比べてはるかに低コストで、実用レベルの多言語ASRを構築できることを示唆している。

English

We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \81 on a single RTX PRO 6000 GPU compared to 18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.

ポリグロット・ライオン：Qwen3-ASRのバランス調整によるシンガポール向け高効率多言語音声認識

Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR

要旨

Support