Whisper-LM: 低リソース言語向けの音声認識モデルを言語モデルで改善する

要旨

自動音声認識システムは、Whisperのような多言語・マルチタスクモデルの統合により、間違いなく進化を遂げてきました。これらのモデルは、幅広い言語における音声の理解と処理において有望な能力を示しています。しかし、その堅牢性にもかかわらず、これらのモデルはしばしば少数言語の言語的区別を扱う際に課題を抱えています。本研究では、このギャップを埋めるために、伝統的および新規の言語モデルを微調整されたWhisperモデルと統合し、あまり研究されていない言語における性能向上を図ります。複数のデータセットを用いた厳密な微調整と評価を通じて、特に低リソース環境において、単語誤り率の大幅な改善を実証しました。我々のアプローチは、Whisperが事前学習した広範なデータを活用するだけでなく、言語モデルを組み込むことでその言語的適応性を補完します。統計的言語モデルを使用することで、分布内データセットでは最大51%、分布外の文では最大34%の改善を達成し、大規模言語モデルは多様な言語的文脈において中程度ながら一貫して堅牢な改善を提供しました。これらの結果から、統合はすべてのモデルサイズにおいて確実に利益をもたらすものの、改善の程度は異なり、最適化された言語モデルパラメータの重要性が浮き彫りになりました。最後に、TransformerベースのASRモデルを使用して結果を報告する際に、適切な評価パラメータを選択することの重要性を強調します。要約すると、本研究は、言語的知識を豊かにすることで、より包括的なASR技術の道を切り開き、言語横断的に優れた性能を発揮することを目指しています。本研究のさらなる実装詳細については、技術文書とソースコードがhttp://www.github.com/hitz-zentroa/whisper-lmで公開されています。

English

Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51\% for in-distribution datasets and up to 34\% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.

Whisper-LM: 低リソース言語向けの音声認識モデルを言語モデルで改善する

Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

要旨

Support