ChatPaper.aiChatPaper

SpeechX:神經編碼器語言模型作為多功能語音轉換器

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

August 14, 2023
作者: Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka
cs.AI

摘要

最近基於音訊-文本提示的生成式語音模型取得了重大進展,使得高質量的零-shot文本轉語音等創新成果成為可能。然而,現有模型在處理多樣的音訊-文本語音生成任務上仍存在限制,包括轉換輸入語音和處理在惡劣聲學環境下捕獲的音訊。本文介紹了SpeechX,一個多功能語音生成模型,能夠進行零-shot文本轉語音和各種語音轉換任務,處理乾淨和嘈雜信號。SpeechX結合了神經編解碼器語言建模和使用任務相關提示的多任務學習,實現統一且可擴展的建模,並提供了一致的方式來利用文本輸入進行語音增強和轉換任務。實驗結果顯示SpeechX在各種任務中的有效性,包括零-shot文本轉語音、降噪、目標說話者提取、語音去除以及帶或不帶背景噪音的語音編輯,實現了與專用模型相當或更優秀的性能。請參見https://aka.ms/speechx以獲取演示樣本。
English
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.
PDF271December 15, 2024