수십 시간에서 수만 시간으로: 음성 인식을 위한 역번역 확장

초록

최근 자동 음성 인식(Automatic Speech Recognition, ASR) 분야의 발전은 대규모 음성 코퍼스에 크게 힘입어 이루어졌습니다. 그러나 제한된 자원으로 다양한 언어로의 확장은 여전히 큰 도전 과제로 남아 있습니다. 본 논문은 기존 텍스트-음성 변환(Text-to-Speech, TTS) 모델을 통해 대규모 텍스트 코퍼스를 합성 음성으로 변환함으로써 다국어 ASR 모델을 개선하는 확장 가능한 파이프라인인 Speech Back-Translation을 소개합니다. 우리는 실제로 기록된 수십 시간 분량의 음성만으로도 TTS 모델을 효과적으로 훈련시켜 원본 볼륨의 수백 배에 달하는 고품질의 합성 음성을 생성할 수 있음을 입증했습니다. 합성 음성의 품질을 평가하기 위해 명료도 기반 평가 프레임워크를 개발하고, 합성 데이터가 ASR 훈련에 도움이 되는 명확한 기준을 설정했습니다. Speech Back-Translation을 사용하여 10개 언어로 50만 시간 이상의 합성 음성을 생성하고, Whisper-large-v3 모델의 사전 훈련을 계속하여 평균 전사 오류를 30% 이상 감소시켰습니다. 이러한 결과는 다국어 ASR 시스템을 강화하는 데 있어 Speech Back-Translation의 확장성과 효과성을 강조합니다.

English

Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

수십 시간에서 수만 시간으로: 음성 인식을 위한 역번역 확장

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

초록

Support