ChatPaper.aiChatPaper

一個利用TTS合成數據增強ASR的自優化框架

A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

June 10, 2025
作者: Cheng Kang Chou, Chan-Jan Hsu, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan Po Huang, Hung-Yi Lee
cs.AI

摘要

我們提出了一種自我精煉框架,僅利用未標註的數據集即可提升自動語音識別(ASR)的性能。該過程始於現有ASR模型在未註解語音上生成偽標籤,這些偽標籤隨後用於訓練一個高保真度的文本轉語音(TTS)系統。接著,將合成的語音文本對引導回原始ASR系統,完成閉環自我改進的循環。我們在台灣普通話語音上驗證了該框架的有效性。利用6,000小時的未標註語音、適量的文本數據以及AI模型生成的合成內容,我們將Whisper-large-v2調整為專用模型Twister。與Whisper相比,Twister在普通話及普通話-英語語碼轉換基準測試中,分別降低了最多20%和50%的錯誤率。結果表明,該框架是偽標籤自我蒸餾方法的一個有力替代方案,並為在低資源或特定領域環境中提升ASR性能提供了一條實用途徑。
English
We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. We demonstrated the effectiveness of the framework on Taiwanese Mandarin speech. Leveraging 6,000 hours of unlabeled speech, a moderate amount of text data, and synthetic content from the AI models, we adapt Whisper-large-v2 into a specialized model, Twister. Twister reduces error rates by up to 20% on Mandarin and 50% on Mandarin-English code-switching benchmarks compared to Whisper. Results highlight the framework as a compelling alternative to pseudo-labeling self-distillation approaches and provides a practical pathway for improving ASR performance in low-resource or domain-specific settings.
PDF42June 16, 2025