月光變奏曲:面向邊緣設備的微型專用語音識別模型
Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices
September 2, 2025
作者: Evan King, Adam Sabra, Manjunath Kudlur, James Wang, Pete Warden
cs.AI
摘要
我們推出「月光精粹」系列,這是一套專為多種代表性不足語言設計的微型自動語音識別(ASR)模型。普遍觀點認為,多語言ASR模型通過利用跨語言語音相似性,其性能優於單語言模型。我們對此假設提出挑戰,證明對於足夠小的模型(2700萬參數),在精心平衡的高質量人工標註、偽標註及合成數據上訓練的單語言系統,能顯著提升性能。平均而言,我們的模型錯誤率比同等規模的Whisper Tiny模型低48%,超越參數量9倍於其的Whisper Small模型,且在大多數情況下,與參數量28倍於其的Whisper Medium模型相當甚至更優。這些成果推動了此類規模模型的技術前沿,為先前支持有限的語言實現了精準的設備端ASR。我們以寬鬆的開源許可發布了阿拉伯語、中文、日語、韓語、烏克蘭語及越南語的「月光精粹」模型。
English
We present the Flavors of Moonshine, a suite of tiny automatic speech
recognition (ASR) models specialized for a range of underrepresented languages.
Prevailing wisdom suggests that multilingual ASR models outperform monolingual
counterparts by exploiting cross-lingual phonetic similarities. We challenge
this assumption, showing that for sufficiently small models (27M parameters),
training monolingual systems on a carefully balanced mix of high-quality
human-labeled, pseudo-labeled, and synthetic data yields substantially superior
performance. On average, our models achieve error rates 48% lower than the
comparably sized Whisper Tiny model, outperform the 9x larger Whisper Small
model, and in most cases match or outperform the 28x larger Whisper Medium
model. These results advance the state of the art for models of this size,
enabling accurate on-device ASR for languages that previously had limited
support. We release Arabic, Chinese, Japanese, Korean, Ukrainian, and
Vietnamese Moonshine models under a permissive open-source license.