從數十小時到數萬小時：擴展反向翻譯在語音識別中的應用

摘要

近期，自動語音辨識（ASR）的顯著進步主要得益於大規模語料庫的應用。然而，在資源有限的情況下，將覆蓋範圍擴展至多樣化語言仍是一大挑戰。本文提出「語音回譯」這一可擴展的流程，透過現成的文本轉語音（TTS）模型，將大規模文本語料轉化為合成語音，從而提升多語言ASR模型的效能。我們證明，僅需數十小時的真實轉錄語音，即可有效訓練TTS模型，生成數百倍於原始數據量的高質量合成語音。為評估合成語音的質量，我們開發了一套基於可理解性的評估框架，並確定了合成數據對ASR訓練有益時的明確閾值。利用語音回譯技術，我們在十種語言中生成了超過50萬小時的合成語音，並對Whisper-large-v3進行了持續預訓練，實現了平均轉錄錯誤率降低超過30%的成果。這些結果凸顯了語音回譯在增強多語言ASR系統方面的可擴展性和有效性。

English

Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

從數十小時到數萬小時：擴展反向翻譯在語音識別中的應用

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

摘要

Support