SpeechX：神经编解码器语言模型作为多功能语音转换器

摘要

基于音频文本提示的生成式语音模型最近取得了显著进展，实现了高质量的零-shot文本转语音等创新。然而，现有模型在处理涉及转换输入语音和处理在恶劣声学条件下捕获的音频的多样化音频文本语音生成任务方面仍然存在局限性。本文介绍了SpeechX，一种多功能语音生成模型，能够进行零-shot TTS和各种语音转换任务，处理干净和嘈杂信号。SpeechX将神经编解码器语言建模与使用任务相关提示的多任务学习相结合，实现了统一和可扩展的建模，并提供了一种一致的方式来利用文本输入进行语音增强和转换任务。实验结果显示，SpeechX在各种任务中表现出色，包括零-shot TTS、降噪、目标说话人提取、语音去除以及带有或不带有背景噪声的语音编辑，其性能与专门模型相比在各项任务中达到了可比或更优的表现。请访问https://aka.ms/speechx查看演示样本。

English

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.

SpeechX：神经编解码器语言模型作为多功能语音转换器

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

摘要

Support