数十時間から数万時間へ：音声認識のための逆翻訳のスケーリング

要旨

近年の自動音声認識（ASR）の進歩は、大規模な音声コーパスによって大きく推進されてきました。しかし、限られたリソースで多様な言語に対応範囲を拡大することは依然として大きな課題です。本論文では、Speech Back-Translationを紹介します。これは、既存のテキスト音声合成（TTS）モデルを利用して大規模なテキストコーパスを合成音声に変換し、多言語ASRモデルを改善するスケーラブルなパイプラインです。わずか数十時間の実音声とその転写データで、TTSモデルを効果的に訓練し、元のボリュームの数百倍の合成音声を高品質で生成できることを実証します。合成音声の品質を評価するために、明瞭度に基づく評価フレームワークを開発し、合成データがASR訓練に有益である明確な閾値を確立します。Speech Back-Translationを使用して、10言語で50万時間以上の合成音声を生成し、Whisper-large-v3の事前訓練を継続することで、平均30％以上の転写エラー削減を達成しました。これらの結果は、多言語ASRシステムを強化するためのSpeech Back-Translationのスケーラビリティと有効性を強調しています。

English

Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

数十時間から数万時間へ：音声認識のための逆翻訳のスケーリング

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

要旨

Support