SpeechX：ニューラルコーデック言語モデルとしての汎用音声トランスフォーマー

要旨

音声テキストプロンプトに基づく生成音声モデルの最近の進展により、高品質なゼロショットテキスト読み上げなど、注目すべきイノベーションが実現されています。しかし、既存のモデルは、入力音声の変換や劣悪な音響条件下で録音された音声の処理など、多様な音声テキスト生成タスクを扱う際に依然として制約があります。本論文では、SpeechXを紹介します。これは、クリーンな信号とノイズを含む信号の両方を扱うことができ、ゼロショットTTSや様々な音声変換タスクに対応可能な汎用音声生成モデルです。SpeechXは、ニューラルコーデック言語モデリングとタスク依存プロンプトを用いたマルチタスク学習を組み合わせることで、統一された拡張可能なモデリングを実現し、音声強調や変換タスクにおいてテキスト入力を活用する一貫した方法を提供します。実験結果は、SpeechXがゼロショットTTS、ノイズ抑制、ターゲットスピーカー抽出、音声除去、背景ノイズの有無にかかわらずの音声編集など、様々なタスクにおいて有効であり、各タスクで特化モデルに匹敵するかそれ以上の性能を達成することを示しています。デモサンプルはhttps://aka.ms/speechxをご覧ください。

English

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.

SpeechX：ニューラルコーデック言語モデルとしての汎用音声トランスフォーマー

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

要旨

Support