DMOSpeech 2:基于强化学习的度量优化语音合成时长预测
DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis
July 20, 2025
作者: Yinghao Aaron Li, Xilin Jiang, Fei Tao, Cheng Niu, Kaifeng Xu, Juntong Song, Nima Mesgarani
cs.AI
摘要
基于扩散模型的文本转语音(TTS)系统在零样本语音合成领域取得了显著进展,然而针对感知指标优化所有组件仍具挑战性。先前的研究通过DMOSpeech展示了直接优化语音生成组件的指标,但时长预测部分尚未得到优化。本文提出了DMOSpeech 2,通过强化学习方法将指标优化扩展至时长预测器。该系统采用了一种新颖的时长策略框架,结合群体相对偏好优化(GRPO),以说话人相似度和词错误率作为奖励信号。通过优化这一先前未优化的组件,DMOSpeech 2构建了一个更为完整的指标优化合成流程。此外,本文还引入了教师引导采样,这是一种混合方法,利用教师模型进行初始去噪步骤,随后过渡到学生模型,在保持效率的同时显著提升了输出多样性。全面评估显示,与之前系统相比,DMOSpeech 2在所有指标上均表现出色,同时采样步骤减少一半且无质量下降。这些进展标志着在多个组件上实现指标优化的语音合成系统迈出了重要一步。音频样本、代码及预训练模型可在https://dmospeech2.github.io/获取。
English
Diffusion-based text-to-speech (TTS) systems have made remarkable progress in
zero-shot speech synthesis, yet optimizing all components for perceptual
metrics remains challenging. Prior work with DMOSpeech demonstrated direct
metric optimization for speech generation components, but duration prediction
remained unoptimized. This paper presents DMOSpeech 2, which extends metric
optimization to the duration predictor through a reinforcement learning
approach. The proposed system implements a novel duration policy framework
using group relative preference optimization (GRPO) with speaker similarity and
word error rate as reward signals. By optimizing this previously unoptimized
component, DMOSpeech 2 creates a more complete metric-optimized synthesis
pipeline. Additionally, this paper introduces teacher-guided sampling, a hybrid
approach leveraging a teacher model for initial denoising steps before
transitioning to the student model, significantly improving output diversity
while maintaining efficiency. Comprehensive evaluations demonstrate superior
performance across all metrics compared to previous systems, while reducing
sampling steps by half without quality degradation. These advances represent a
significant step toward speech synthesis systems with metric optimization
across multiple components. The audio samples, code and pre-trained models are
available at https://dmospeech2.github.io/.