DMOSpeech 2：基於強化學習的時長預測在度量優化語音合成中的應用

摘要

基於擴散模型的文字轉語音（TTS）系統在零樣本語音合成方面取得了顯著進展，然而針對感知指標優化所有組件仍具挑戰性。先前的研究DMOSpeech展示了對語音生成組件進行直接指標優化的方法，但時長預測部分仍未得到優化。本文提出了DMOSpeech 2，通過強化學習方法將指標優化擴展至時長預測器。該系統採用了一種新穎的時長策略框架，利用群組相對偏好優化（GRPO），並以說話者相似度和詞錯誤率作為獎勵信號。通過優化這一先前未經優化的組件，DMOSpeech 2構建了一個更為完整的指標優化合成流程。此外，本文還引入了教師引導採樣，這是一種混合方法，利用教師模型進行初始去噪步驟，然後轉換到學生模型，在保持效率的同時顯著提升了輸出多樣性。全面評估顯示，與之前的系統相比，DMOSpeech 2在所有指標上均表現出優異性能，同時將採樣步驟減少了一半且未造成質量下降。這些進展代表了在實現多組件指標優化的語音合成系統方面邁出的重要一步。音頻樣本、代碼及預訓練模型可於https://dmospeech2.github.io/獲取。

English

Diffusion-based text-to-speech (TTS) systems have made remarkable progress in zero-shot speech synthesis, yet optimizing all components for perceptual metrics remains challenging. Prior work with DMOSpeech demonstrated direct metric optimization for speech generation components, but duration prediction remained unoptimized. This paper presents DMOSpeech 2, which extends metric optimization to the duration predictor through a reinforcement learning approach. The proposed system implements a novel duration policy framework using group relative preference optimization (GRPO) with speaker similarity and word error rate as reward signals. By optimizing this previously unoptimized component, DMOSpeech 2 creates a more complete metric-optimized synthesis pipeline. Additionally, this paper introduces teacher-guided sampling, a hybrid approach leveraging a teacher model for initial denoising steps before transitioning to the student model, significantly improving output diversity while maintaining efficiency. Comprehensive evaluations demonstrate superior performance across all metrics compared to previous systems, while reducing sampling steps by half without quality degradation. These advances represent a significant step toward speech synthesis systems with metric optimization across multiple components. The audio samples, code and pre-trained models are available at https://dmospeech2.github.io/.

DMOSpeech 2：基於強化學習的時長預測在度量優化語音合成中的應用

DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

摘要

Support