폴리글로트-라이언: Qwen3-ASR의 균형 잡힌 미세 조정을 통한 싱가포르용 효율적 다국어 음성 인식

초록

싱가포르의 언어적 환경에 맞춰 영어, 중국어, 타밀어, 말레이어를 지원하는 컴팩트 다국어 자동 음성 인식(ASR) 모델 패밀리인 Polyglot-Lion을 소개합니다. 본 모델은 공개 음성 코퍼스만을 사용하여 Qwen3-ASR-0.6B와 Qwen3-ASR-1.7B를 미세 조정하여 구축되었으며, 언어별 학습 발화 수를 균등하게 하는 균형 샘플링 전략을 적용하고 언어 태그 조건화를 의도적으로 배제하여 모델이 오디오로부터 언어를 암묵적으로 식별하도록 학습했습니다. 4가지 대상 언어를 아우르는 12개 벤치마크에서 Polyglot-Lion-1.7B는 평균 오류율 14.85를 기록하며, 규모가 6배 큰 모델인 MERaLiON-2-10B-ASR(14.32)에 버금가는 성능을 달성했습니다. 동시에 단일 RTX PRO 6000 GPU에서 \81의 학습 비용이 발생하여 128-GPU 기준 시스템의 18,862 대비 극히 낮은 비용을 보였습니다. 추론 처리량은 샘플당 0.10초로 MERaLiON의 샘플당 2.02초보다 약 20배 빠릅니다. 이러한 결과는 중규모 사전 학습 모델에 언어적 균형을 고려한 미세 조정을 적용하면, 대규모 전문 시스템 대비 훨씬 낮은 비용으로 배포 가능한 수준의 다국어 ASR을 구현할 수 있음을 입증합니다.

English

We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \81 on a single RTX PRO 6000 GPU compared to 18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.

폴리글로트-라이언: Qwen3-ASR의 균형 잡힌 미세 조정을 통한 싱가포르용 효율적 다국어 음성 인식

Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR

초록

Support