PolyVoice: 音声間翻訳のための言語モデル

要旨

我々は、音声間翻訳（S2ST）システムのための言語モデルベースのフレームワークであるPolyVoiceを提案します。本フレームワークは、翻訳言語モデルと音声合成言語モデルの2つの言語モデルで構成されています。我々は完全に教師なしで生成された離散化音声ユニットを使用しており、これにより本フレームワークは未記述言語にも適用可能です。音声合成部分については、既存のVALL-E Xアプローチを採用し、ユニットベースの音声言語モデルを構築しました。これにより、本フレームワークは元の音声の音声特性や話し方を保持する能力を有しています。我々は本システムを中国語→英語および英語→スペイン語のペアで検証しました。実験結果から、本システムは高い翻訳品質と音声品質を備えた音声を生成できることが示されています。音声サンプルはhttps://speechtranslation.github.io/polyvoiceで公開しています。

English

We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese rightarrow English and English rightarrow Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.

PolyVoice: 音声間翻訳のための言語モデル

PolyVoice: Language Models for Speech to Speech Translation

要旨

Support