SpeechX：神經編碼器語言模型作為多功能語音轉換器

摘要

最近基於音訊-文本提示的生成式語音模型取得了重大進展，使得高質量的零-shot文本轉語音等創新成果成為可能。然而，現有模型在處理多樣的音訊-文本語音生成任務上仍存在限制，包括轉換輸入語音和處理在惡劣聲學環境下捕獲的音訊。本文介紹了SpeechX，一個多功能語音生成模型，能夠進行零-shot文本轉語音和各種語音轉換任務，處理乾淨和嘈雜信號。SpeechX結合了神經編解碼器語言建模和使用任務相關提示的多任務學習，實現統一且可擴展的建模，並提供了一致的方式來利用文本輸入進行語音增強和轉換任務。實驗結果顯示SpeechX在各種任務中的有效性，包括零-shot文本轉語音、降噪、目標說話者提取、語音去除以及帶或不帶背景噪音的語音編輯，實現了與專用模型相當或更優秀的性能。請參見https://aka.ms/speechx以獲取演示樣本。

English

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.

SpeechX：神經編碼器語言模型作為多功能語音轉換器

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

摘要

Support