Whisper-LM:利用語言模型提升低資源語言的語音辨識模型效能
Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages
March 30, 2025
作者: Xabier de Zuazo, Eva Navas, Ibon Saratxaga, Inma Hernáez Rioja
cs.AI
摘要
自動語音辨識系統無疑地隨著多語言與多任務模型(如Whisper)的整合而取得了進展,這些模型展現了理解和處理廣泛語言語音的潛力。儘管這些模型具有魯棒性,但在處理少數民族語言的語言特徵時往往表現不足。本研究通過將傳統與新穎的語言模型與微調後的Whisper模型相結合,來填補這一差距,從而提升其在較少研究語言中的表現。通過在多個數據集上進行嚴格的微調與評估,我們展示了在詞錯誤率上的顯著改善,特別是在低資源情境下。我們的方法不僅利用了Whisper預訓練時所依賴的大量數據,還通過整合語言模型來補充其語言適應性。使用統計語言模型時,我們在分佈內數據集上獲得了高達51%的改善,在分佈外句子中則達到了34%的提升,而大型語言模型則在多樣化的語言環境中提供了雖適中但始終穩健的改進。研究結果揭示,雖然整合對所有模型規模都有可靠的好處,但改善的程度各異,這凸顯了優化語言模型參數的重要性。最後,我們強調了在使用基於Transformer的ASR模型報告結果時,選擇合適評估參數的重要性。總之,這項研究為開發更具包容性的ASR技術鋪平了道路,這些技術通過豐富其語言知識,在跨語言表現上更為出色。有關本研究的進一步實施細節,技術文檔與源代碼可在http://www.github.com/hitz-zentroa/whisper-lm獲取。
English
Automatic speech recognition systems have undoubtedly advanced with the
integration of multilingual and multitask models such as Whisper, which have
shown a promising ability to understand and process speech across a wide range
of languages. Despite their robustness, these models often fall short in
handling the linguistic distinctions of minority languages. This study
addresses this gap by integrating traditional and novel language models with
fine-tuned Whisper models to raise their performance in less commonly studied
languages. Through rigorous fine-tuning and evaluation across multiple
datasets, we demonstrate substantial improvements in word error rate,
particularly in low-resource scenarios. Our approach not only does take
advantage of the extensive data Whisper was pre-trained on, but also
complements its linguistic adaptability by incorporating language models. We
obtained improvements up to 51\% for in-distribution datasets and up to 34\%
for out-of-distribution sentences using statistical language models, while
large language models provided moderate but consistently robust improvement
across diverse linguistic contexts. The findings reveal that, while the
integration reliably benefits all model sizes, the extent of improvement
varies, highlighting the importance of optimized language model parameters.
Finally, we emphasize the importance of selecting appropriate evaluation
parameters when reporting the results using transformer-based ASR models. In
summary, this research clears the way for more inclusive ASR technologies that
perform better across languages by enriching their linguistic knowledge. For
further implementation details of this study, the technical documentation and
source code are available at http://www.github.com/hitz-zentroa/whisper-lm.