DMOSpeech 2: メトリック最適化音声合成における持続時間予測のための強化学習

要旨

拡散モデルに基づくテキスト音声合成（TTS）システムは、ゼロショット音声合成において顕著な進歩を遂げてきたが、すべてのコンポーネントを知覚的指標に最適化することは依然として課題である。先行研究であるDMOSpeechでは、音声生成コンポーネントに対する直接的な指標最適化が実証されたが、持続時間予測は最適化されていなかった。本論文では、DMOSpeech 2を提案し、強化学習アプローチを通じて持続時間予測器にまで指標最適化を拡張する。提案システムは、話者類似度と単語誤り率を報酬信号として用いたグループ相対選好最適化（GRPO）に基づく新たな持続時間ポリシーフレームワークを実装する。この従来最適化されていなかったコンポーネントを最適化することで、DMOSpeech 2はより完全な指標最適化合成パイプラインを構築する。さらに、本論文では、教師モデルを初期のノイズ除去ステップに活用し、その後学生モデルに移行するハイブリッドアプローチである教師誘導サンプリングを導入し、効率を維持しながら出力の多様性を大幅に向上させる。包括的な評価により、従来のシステムと比較してすべての指標で優れた性能を示し、品質の低下なしにサンプリングステップを半減させることが実証された。これらの進展は、複数のコンポーネントにわたる指標最適化を備えた音声合成システムに向けた重要な一歩を表している。音声サンプル、コード、および事前学習済みモデルはhttps://dmospeech2.github.io/で公開されている。

English

Diffusion-based text-to-speech (TTS) systems have made remarkable progress in zero-shot speech synthesis, yet optimizing all components for perceptual metrics remains challenging. Prior work with DMOSpeech demonstrated direct metric optimization for speech generation components, but duration prediction remained unoptimized. This paper presents DMOSpeech 2, which extends metric optimization to the duration predictor through a reinforcement learning approach. The proposed system implements a novel duration policy framework using group relative preference optimization (GRPO) with speaker similarity and word error rate as reward signals. By optimizing this previously unoptimized component, DMOSpeech 2 creates a more complete metric-optimized synthesis pipeline. Additionally, this paper introduces teacher-guided sampling, a hybrid approach leveraging a teacher model for initial denoising steps before transitioning to the student model, significantly improving output diversity while maintaining efficiency. Comprehensive evaluations demonstrate superior performance across all metrics compared to previous systems, while reducing sampling steps by half without quality degradation. These advances represent a significant step toward speech synthesis systems with metric optimization across multiple components. The audio samples, code and pre-trained models are available at https://dmospeech2.github.io/.

DMOSpeech 2: メトリック最適化音声合成における持続時間予測のための強化学習

DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

要旨

Support