ゼロショットクロスリンガル音声変換によるTTS

要旨

本論文では、マルチリンガルテキスト読み上げ（TTS）システムにシームレスに統合できるゼロショット音声転送（VT）モジュールを紹介します。このモジュールは、個人の声を言語間で転送することが可能です。提案されたVTモジュールには、参照音声を処理するスピーカーエンコーダー、ボトルネック層、および既存のTTS層に接続された残差アダプタが含まれています。これらのコンポーネントのさまざまな構成のパフォーマンスを比較し、言語間の平均意見スコア（MOS）と話者類似性を報告します。1人あたり1つの英語参照音声を使用して、9つの対象言語間で73％の平均音声転送類似性スコアを達成しました。声の特性は、個人のアイデンティティの構築と認識に大きく貢献します。身体的または神経学的な状態による自分の声の喪失は、核となるアイデンティティに深い喪失感をもたらす可能性があります。事例として、典型的な音声だけでなく、非典型的な音声サンプルしか利用できない場合でも、disarthriaを持つ個人の声を回復できることを示しました。これは、典型的な音声を持ったことがないか、声をバンクに預けたことがない人々にとって貴重なユーティリティです。クロスリンガルな典型的なオーディオサンプルと、disarthriaスピーカーの声の回復をデモンストレーションしたビデオはこちらでご覧いただけます（google.github.io/tacotron/publications/zero_shot_voice_transfer）。

English

In this paper, we introduce a zero-shot Voice Transfer (VT) module that can be seamlessly integrated into a multi-lingual Text-to-speech (TTS) system to transfer an individual's voice across languages. Our proposed VT module comprises a speaker-encoder that processes reference speech, a bottleneck layer, and residual adapters, connected to preexisting TTS layers. We compare the performance of various configurations of these components and report Mean Opinion Score (MOS) and Speaker Similarity across languages. Using a single English reference speech per speaker, we achieve an average voice transfer similarity score of 73% across nine target languages. Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one's voice, due to physical or neurological conditions, can lead to a profound sense of loss, impacting one's core identity. As a case study, we demonstrate that our approach can not only transfer typical speech but also restore the voices of individuals with dysarthria, even when only atypical speech samples are available - a valuable utility for those who have never had typical speech or banked their voice. Cross-lingual typical audio samples, plus videos demonstrating voice restoration for dysarthric speakers are available here (google.github.io/tacotron/publications/zero_shot_voice_transfer).