PolyVoice：语言模型用于语音到语音翻译。

摘要

我们提出了PolyVoice，这是一个基于语言模型的语音到语音翻译（S2ST）系统框架。我们的框架包括两个语言模型：一个翻译语言模型和一个语音合成语言模型。我们使用离散化的语音单元，这些单元是完全无监督生成的，因此我们的框架可用于未书写的语言。对于语音合成部分，我们采用现有的VALL-E X方法，并构建基于单元的音频语言模型。这使我们的框架能够保留原始语音的语音特征和说话风格。我们在中文到英文和英文到西班牙文对上测试了我们的系统。实验结果显示，我们的系统能够生成具有高翻译质量和音频质量的语音。语音样本可在https://speechtranslation.github.io/polyvoice找到。

English

We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese rightarrow English and English rightarrow Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.

PolyVoice：语言模型用于语音到语音翻译。

PolyVoice: Language Models for Speech to Speech Translation

摘要

Support