トークンごとの潜在拡散を用いた連続音声合成

要旨

離散トークンを用いた自己回帰トランスフォーマーモデルの成功は、連続モダリティ向けの量子化ベースの手法に影響を与えましたが、これらはしばしば再構成品質を制限します。そこで、我々は、ゼロショットのテキストから音声への変換のためのトークンごとの潜在拡散モデルであるSALADを紹介します。SALADは、連続表現上で動作し、最近提案された画像生成のための表現力豊かな拡散ヘッドに基づき、可変長の出力を生成するよう拡張されています。我々の手法は、文脈情報を提供し停止条件を決定するために意味トークンを利用します。我々は、人気のある離散音声合成技術を拡張するために、我々の手法に対して3つの連続バリアントを提案します。さらに、各バリアントに対して離散ベースラインを実装し、離散と連続の音声モデリング技術の比較分析を行います。我々の結果は、連続と離散の両アプローチが非常に有能であり、SALADが優れた理解度スコアを達成しながら、音声品質と話者の類似性を真のオーディオと同等に獲得していることを示しています。

English

The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.

トークンごとの潜在拡散を用いた連続音声合成

Continuous Speech Synthesis using per-token Latent Diffusion

要旨

Support