ChatPaper.aiChatPaper

从数十小时到数万小时:扩展反向翻译在语音识别中的应用规模

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

May 22, 2025
作者: Tianduo Wang, Lu Xu, Wei Lu, Shanbo Cheng
cs.AI

摘要

自动语音识别(ASR)领域的最新进展主要得益于大规模语音语料库的积累。然而,在资源有限的情况下,将覆盖范围扩展至多种语言仍是一项艰巨挑战。本文提出了一种名为“语音回译”的可扩展流程,该流程通过现成的文本转语音(TTS)模型将大规模文本语料库转化为合成语音,从而提升多语言ASR模型的性能。我们证明,仅需数十小时的真实转录语音即可有效训练TTS模型,生成数百倍于原始数据量的高质量合成语音。为评估合成语音质量,我们开发了一套基于可懂度的评估框架,并确立了合成数据何时有益于ASR训练的明确阈值。利用语音回译技术,我们在十种语言中生成了超过50万小时的合成语音,并继续对Whisper-large-v3进行预训练,实现了平均转录错误率降低超过30%的成果。这些结果凸显了语音回译技术在增强多语言ASR系统方面的可扩展性和有效性。
English
Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

Summary

AI-Generated Summary

PDF92May 27, 2025