ChatPaper.aiChatPaper

一种利用TTS合成数据进行ASR增强的自优化框架

A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

June 10, 2025
作者: Cheng Kang Chou, Chan-Jan Hsu, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan Po Huang, Hung-Yi Lee
cs.AI

摘要

我们提出了一种自优化框架,仅利用未标注数据集即可提升自动语音识别(ASR)性能。该过程始于现有ASR模型对未标注语音生成伪标签,随后这些标签用于训练一个高保真度的文本转语音(TTS)系统。接着,合成的语音文本对被引导回原始ASR系统,形成一个闭环自我提升循环。我们在台湾普通话语音上验证了该框架的有效性。通过利用6000小时的未标注语音、适量文本数据及AI模型生成的合成内容,我们将Whisper-large-v2适配为专用模型Twister。与Whisper相比,Twister在普通话识别错误率上降低了高达20%,在普通话-英语代码切换基准测试上更是减少了50%。这些成果凸显了该框架作为伪标签自蒸馏方法的有力替代方案,并为在资源匮乏或特定领域场景下提升ASR性能提供了一条实用路径。
English
We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. We demonstrated the effectiveness of the framework on Taiwanese Mandarin speech. Leveraging 6,000 hours of unlabeled speech, a moderate amount of text data, and synthetic content from the AI models, we adapt Whisper-large-v2 into a specialized model, Twister. Twister reduces error rates by up to 20% on Mandarin and 50% on Mandarin-English code-switching benchmarks compared to Whisper. Results highlight the framework as a compelling alternative to pseudo-labeling self-distillation approaches and provides a practical pathway for improving ASR performance in low-resource or domain-specific settings.
PDF42June 16, 2025