SpeechX:神經編碼器語言模型作為多功能語音轉換器
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
August 14, 2023
作者: Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka
cs.AI
摘要
最近基於音訊-文本提示的生成式語音模型取得了重大進展,使得高質量的零-shot文本轉語音等創新成果成為可能。然而,現有模型在處理多樣的音訊-文本語音生成任務上仍存在限制,包括轉換輸入語音和處理在惡劣聲學環境下捕獲的音訊。本文介紹了SpeechX,一個多功能語音生成模型,能夠進行零-shot文本轉語音和各種語音轉換任務,處理乾淨和嘈雜信號。SpeechX結合了神經編解碼器語言建模和使用任務相關提示的多任務學習,實現統一且可擴展的建模,並提供了一致的方式來利用文本輸入進行語音增強和轉換任務。實驗結果顯示SpeechX在各種任務中的有效性,包括零-shot文本轉語音、降噪、目標說話者提取、語音去除以及帶或不帶背景噪音的語音編輯,實現了與專用模型相當或更優秀的性能。請參見https://aka.ms/speechx以獲取演示樣本。
English
Recent advancements in generative speech models based on audio-text prompts
have enabled remarkable innovations like high-quality zero-shot text-to-speech.
However, existing models still face limitations in handling diverse audio-text
speech generation tasks involving transforming input speech and processing
audio captured in adverse acoustic conditions. This paper introduces SpeechX, a
versatile speech generation model capable of zero-shot TTS and various speech
transformation tasks, dealing with both clean and noisy signals. SpeechX
combines neural codec language modeling with multi-task learning using
task-dependent prompting, enabling unified and extensible modeling and
providing a consistent way for leveraging textual input in speech enhancement
and transformation tasks. Experimental results show SpeechX's efficacy in
various tasks, including zero-shot TTS, noise suppression, target speaker
extraction, speech removal, and speech editing with or without background
noise, achieving comparable or superior performance to specialized models
across tasks. See https://aka.ms/speechx for demo samples.