零-shot 跨语言语音转换用于 TTS

摘要

本文介绍了一种零样本语音转换（VT）模块，可无缝集成到多语言文本转语音（TTS）系统中，实现跨语言转换个人的语音。我们提出的VT模块包括一个处理参考语音的说话人编码器、一个瓶颈层和残差适配器，连接到现有的TTS层。我们比较了这些组件的各种配置的性能，并报告了跨语言的平均意见分数（MOS）和说话人相似度。使用每位说话人的单个英语参考语音，我们在九种目标语言中实现了平均语音转换相似度得分达到73%。声音特征对于构建和感知个体身份具有重要影响。由于生理或神经状况导致声音丧失可能会引发对核心身份的深刻失落感。作为一个案例研究，我们演示了我们的方法不仅可以转换典型语音，还可以恢复患有运动障碍的个体的声音，即使只有非典型语音样本可用-对于那些从未有过典型语音或存储过自己声音的人来说，这是一种宝贵的工具。提供跨语言典型音频样本以及演示为运动障碍说话者恢复声音的视频，网址为(google.github.io/tacotron/publications/zero_shot_voice_transfer)。

English

In this paper, we introduce a zero-shot Voice Transfer (VT) module that can be seamlessly integrated into a multi-lingual Text-to-speech (TTS) system to transfer an individual's voice across languages. Our proposed VT module comprises a speaker-encoder that processes reference speech, a bottleneck layer, and residual adapters, connected to preexisting TTS layers. We compare the performance of various configurations of these components and report Mean Opinion Score (MOS) and Speaker Similarity across languages. Using a single English reference speech per speaker, we achieve an average voice transfer similarity score of 73% across nine target languages. Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one's voice, due to physical or neurological conditions, can lead to a profound sense of loss, impacting one's core identity. As a case study, we demonstrate that our approach can not only transfer typical speech but also restore the voices of individuals with dysarthria, even when only atypical speech samples are available - a valuable utility for those who have never had typical speech or banked their voice. Cross-lingual typical audio samples, plus videos demonstrating voice restoration for dysarthric speakers are available here (google.github.io/tacotron/publications/zero_shot_voice_transfer).

零-shot 跨语言语音转换用于 TTS

Zero-shot Cross-lingual Voice Transfer for TTS

摘要

Support