PolyVoice:語言模型用於語音到語音翻譯
PolyVoice: Language Models for Speech to Speech Translation
June 5, 2023
作者: Qianqian Dong, Zhiying Huang, Chen Xu, Yunlong Zhao, Kexin Wang, Xuxin Cheng, Tom Ko, Qiao Tian, Tang Li, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang
cs.AI
摘要
我們提出了 PolyVoice,一個基於語言模型的語音到語音翻譯(S2ST)系統框架。我們的框架包括兩個語言模型:一個是翻譯語言模型,另一個是語音合成語言模型。我們使用離散化的語音單元,這些單元是以完全無監督的方式生成的,因此我們的框架可用於未書寫的語言。對於語音合成部分,我們採用現有的 VALL-E X 方法並建立基於單元的音頻語言模型。這使我們的框架能夠保留原始語音的聲音特徵和說話風格。我們在中文到英文和英文到西班牙文的配對上檢驗我們的系統。實驗結果顯示我們的系統能夠生成具有高翻譯質量和音頻質量的語音。語音樣本可在 https://speechtranslation.github.io/polyvoice 找到。
English
We propose PolyVoice, a language model-based framework for speech-to-speech
translation (S2ST) system. Our framework consists of two language models: a
translation language model and a speech synthesis language model. We use
discretized speech units, which are generated in a fully unsupervised way, and
thus our framework can be used for unwritten languages. For the speech
synthesis part, we adopt the existing VALL-E X approach and build a unit-based
audio language model. This grants our framework the ability to preserve the
voice characteristics and the speaking style of the original speech. We examine
our system on Chinese rightarrow English and English rightarrow Spanish
pairs. Experimental results show that our system can generate speech with high
translation quality and audio quality. Speech samples are available at
https://speechtranslation.github.io/polyvoice.