PolyVoice：語言模型用於語音到語音翻譯

摘要

我們提出了 PolyVoice，一個基於語言模型的語音到語音翻譯（S2ST）系統框架。我們的框架包括兩個語言模型：一個是翻譯語言模型，另一個是語音合成語言模型。我們使用離散化的語音單元，這些單元是以完全無監督的方式生成的，因此我們的框架可用於未書寫的語言。對於語音合成部分，我們採用現有的 VALL-E X 方法並建立基於單元的音頻語言模型。這使我們的框架能夠保留原始語音的聲音特徵和說話風格。我們在中文到英文和英文到西班牙文的配對上檢驗我們的系統。實驗結果顯示我們的系統能夠生成具有高翻譯質量和音頻質量的語音。語音樣本可在 https://speechtranslation.github.io/polyvoice 找到。

English

We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese rightarrow English and English rightarrow Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.

PolyVoice：語言模型用於語音到語音翻譯

PolyVoice: Language Models for Speech to Speech Translation

摘要

Support