모노샤인의 다양한 맛: 에지 디바이스를 위한 소형 전문화 ASR 모델

초록

우리는 다양한 저자원 언어를 위해 특화된 초소형 자동 음성 인식(ASR) 모델군인 'Flavors of Moonshine'을 소개한다. 일반적으로 다국어 ASR 모델이 언어 간 음성적 유사성을 활용하여 단일 언어 모델보다 더 나은 성능을 보인다고 알려져 있다. 그러나 우리는 이 가정에 도전하여, 충분히 작은 모델(2,700만 파라미터)의 경우, 고품질의 인간이 레이블링한 데이터, 의사 레이블링 데이터, 합성 데이터를 신중하게 균형 있게 혼합하여 단일 언어 시스템을 학습시키면 훨씬 우수한 성능을 얻을 수 있음을 보여준다. 평균적으로, 우리의 모델은 비슷한 크기의 Whisper Tiny 모델보다 48% 낮은 오류율을 달성하며, 9배 더 큰 Whisper Small 모델을 능가하고, 대부분의 경우 28배 더 큰 Whisper Medium 모델과 동등하거나 더 나은 성능을 보인다. 이러한 결과는 이 크기의 모델에서 최신 기술 수준을 발전시켜, 이전에 지원이 제한적이었던 언어들에 대해 정확한 온디바이스 ASR을 가능하게 한다. 우리는 아랍어, 중국어, 일본어, 한국어, 우크라이나어, 베트남어 Moonshine 모델을 허용적 오픈소스 라이선스 하에 공개한다.

English

We present the Flavors of Moonshine, a suite of tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages. Prevailing wisdom suggests that multilingual ASR models outperform monolingual counterparts by exploiting cross-lingual phonetic similarities. We challenge this assumption, showing that for sufficiently small models (27M parameters), training monolingual systems on a carefully balanced mix of high-quality human-labeled, pseudo-labeled, and synthetic data yields substantially superior performance. On average, our models achieve error rates 48% lower than the comparably sized Whisper Tiny model, outperform the 9x larger Whisper Small model, and in most cases match or outperform the 28x larger Whisper Medium model. These results advance the state of the art for models of this size, enabling accurate on-device ASR for languages that previously had limited support. We release Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese Moonshine models under a permissive open-source license.

모노샤인의 다양한 맛: 에지 디바이스를 위한 소형 전문화 ASR 모델

Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

초록

Support