ChatPaper.aiChatPaper

DMOSpeech 2:基于强化学习的度量优化语音合成时长预测

DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

July 20, 2025
作者: Yinghao Aaron Li, Xilin Jiang, Fei Tao, Cheng Niu, Kaifeng Xu, Juntong Song, Nima Mesgarani
cs.AI

摘要

基于扩散模型的文本转语音(TTS)系统在零样本语音合成领域取得了显著进展,然而针对感知指标优化所有组件仍具挑战性。先前的研究通过DMOSpeech展示了直接优化语音生成组件的指标,但时长预测部分尚未得到优化。本文提出了DMOSpeech 2,通过强化学习方法将指标优化扩展至时长预测器。该系统采用了一种新颖的时长策略框架,结合群体相对偏好优化(GRPO),以说话人相似度和词错误率作为奖励信号。通过优化这一先前未优化的组件,DMOSpeech 2构建了一个更为完整的指标优化合成流程。此外,本文还引入了教师引导采样,这是一种混合方法,利用教师模型进行初始去噪步骤,随后过渡到学生模型,在保持效率的同时显著提升了输出多样性。全面评估显示,与之前系统相比,DMOSpeech 2在所有指标上均表现出色,同时采样步骤减少一半且无质量下降。这些进展标志着在多个组件上实现指标优化的语音合成系统迈出了重要一步。音频样本、代码及预训练模型可在https://dmospeech2.github.io/获取。
English
Diffusion-based text-to-speech (TTS) systems have made remarkable progress in zero-shot speech synthesis, yet optimizing all components for perceptual metrics remains challenging. Prior work with DMOSpeech demonstrated direct metric optimization for speech generation components, but duration prediction remained unoptimized. This paper presents DMOSpeech 2, which extends metric optimization to the duration predictor through a reinforcement learning approach. The proposed system implements a novel duration policy framework using group relative preference optimization (GRPO) with speaker similarity and word error rate as reward signals. By optimizing this previously unoptimized component, DMOSpeech 2 creates a more complete metric-optimized synthesis pipeline. Additionally, this paper introduces teacher-guided sampling, a hybrid approach leveraging a teacher model for initial denoising steps before transitioning to the student model, significantly improving output diversity while maintaining efficiency. Comprehensive evaluations demonstrate superior performance across all metrics compared to previous systems, while reducing sampling steps by half without quality degradation. These advances represent a significant step toward speech synthesis systems with metric optimization across multiple components. The audio samples, code and pre-trained models are available at https://dmospeech2.github.io/.
PDF72July 25, 2025