TTS 합성 데이터를 활용한 ASR 성능 향상을 위한 자가 정제 프레임워크

초록

레이블이 없는 데이터셋만을 사용하여 ASR 성능을 향상시키는 자가 정제 프레임워크를 제안한다. 이 과정은 기존 ASR 모델이 레이블이 없는 음성 데이터에 대해 의사 레이블(pseudo-label)을 생성하는 것으로 시작하며, 이 의사 레이블은 고품질 텍스트-투-스피치(TTS) 시스템을 학습하는 데 사용된다. 이후, 합성된 음성-텍스트 쌍이 원래 ASR 시스템에 부트스트랩되어 폐쇄형 자가 개선 사이클을 완성한다. 본 프레임워크의 효과를 대만 표준 중국어 음성 데이터를 통해 입증하였다. 6,000시간 분량의 레이블 없는 음성 데이터, 적당량의 텍스트 데이터, 그리고 AI 모델에서 생성된 합성 콘텐츠를 활용하여 Whisper-large-v2를 특화된 모델인 Twister로 적응시켰다. Twister는 Whisper 대비 중국어에서 최대 20%, 중국어-영어 코드 스위칭 벤치마크에서 최대 50%의 오류율 감소를 달성하였다. 이러한 결과는 본 프레임워크가 의사 레이블링 자기 증류(self-distillation) 접근법에 대한 강력한 대안임을 보여주며, 저자원 또는 도메인 특화 환경에서 ASR 성능을 개선하기 위한 실용적인 경로를 제공한다.

English

We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. We demonstrated the effectiveness of the framework on Taiwanese Mandarin speech. Leveraging 6,000 hours of unlabeled speech, a moderate amount of text data, and synthetic content from the AI models, we adapt Whisper-large-v2 into a specialized model, Twister. Twister reduces error rates by up to 20% on Mandarin and 50% on Mandarin-English code-switching benchmarks compared to Whisper. Results highlight the framework as a compelling alternative to pseudo-labeling self-distillation approaches and provides a practical pathway for improving ASR performance in low-resource or domain-specific settings.

TTS 합성 데이터를 활용한 ASR 성능 향상을 위한 자가 정제 프레임워크

A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

초록

Support