ムーンシャインの多様性：エッジデバイス向けの小型特化ASRモデル

要旨

私たちは、多様なマイナー言語に特化した小型自動音声認識（ASR）モデルのスイート「Flavors of Moonshine」を紹介します。従来の知見では、多言語ASRモデルが言語間の音声的類似性を活用することで、単一言語モデルを凌駕するとされてきました。しかし、私たちはこの仮説に異議を唱え、十分に小規模なモデル（2,700万パラメータ）において、高品質な人手ラベルデータ、擬似ラベルデータ、合成データを慎重にバランスさせて訓練した単一言語システムが、大幅に優れた性能を発揮することを示します。平均して、私たちのモデルは同等サイズのWhisper Tinyモデルよりも48%低いエラーレートを達成し、9倍大きいWhisper Smallモデルを上回り、ほとんどの場合において28倍大きいWhisper Mediumモデルに匹敵またはそれを凌駕します。これらの結果は、このサイズのモデルにおける最先端技術を前進させ、これまでサポートが限られていた言語においても、正確なオンデバイスASRを可能にします。私たちは、アラビア語、中国語、日本語、韓国語、ウクライナ語、ベトナム語のMoonshineモデルを、寛容なオープンソースライセンスの下で公開します。

English

We present the Flavors of Moonshine, a suite of tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages. Prevailing wisdom suggests that multilingual ASR models outperform monolingual counterparts by exploiting cross-lingual phonetic similarities. We challenge this assumption, showing that for sufficiently small models (27M parameters), training monolingual systems on a carefully balanced mix of high-quality human-labeled, pseudo-labeled, and synthetic data yields substantially superior performance. On average, our models achieve error rates 48% lower than the comparably sized Whisper Tiny model, outperform the 9x larger Whisper Small model, and in most cases match or outperform the 28x larger Whisper Medium model. These results advance the state of the art for models of this size, enabling accurate on-device ASR for languages that previously had limited support. We release Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese Moonshine models under a permissive open-source license.

ムーンシャインの多様性：エッジデバイス向けの小型特化ASRモデル

Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

要旨

Support