基于流匹配的零-shot 文本转语音:随心所欲笑声
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
February 12, 2024
作者: Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Canrun Li, Steven Tsai, Zhen Xiao, Yufei Xia, Jinzhu Li, Yanqing Liu, Sheng Zhao, Michael Zeng
cs.AI
摘要
笑声是人类语音中最具表现力和自然的方面之一,传达情感、社交暗示和幽默。然而,大多数文本转语音(TTS)系统缺乏产生逼真和恰当笑声的能力,限制了它们的应用和用户体验。虽然之前有一些工作致力于生成自然笑声,但在控制笑声的时机和多样性方面存在不足。在本研究中,我们提出ELaTE,一种零样本TTS,可以根据短音频提示生成任何说话者的自然笑声语音,并精确控制笑声的时机和表达。具体而言,ELaTE通过音频提示模仿声音特征,通过文本提示指示生成语音的内容,通过输入控制笑声表达,可以是笑声的开始和结束时间,或包含要模仿笑声的额外音频提示。我们基于条件流匹配的零样本TTS基础开发了我们的模型,并通过笑声检测器的帧级表示进行微调作为额外的条件。通过简单的方案将小规模笑声条件数据与大规模预训练数据混合,我们证明了预训练的零样本TTS模型可以轻松微调以生成具有精确可控性的自然笑声,而不会损失预训练的零样本TTS模型的任何质量。通过评估,我们展示ELaTE可以生成笑声语音,质量和可控性明显优于传统模型。请访问https://aka.ms/elate/查看演示样本。
English
Laughter is one of the most expressive and natural aspects of human speech,
conveying emotions, social cues, and humor. However, most text-to-speech (TTS)
systems lack the ability to produce realistic and appropriate laughter sounds,
limiting their applications and user experience. While there have been prior
works to generate natural laughter, they fell short in terms of controlling the
timing and variety of the laughter to be generated. In this work, we propose
ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker
based on a short audio prompt with precise control of laughter timing and
expression. Specifically, ELaTE works on the audio prompt to mimic the voice
characteristic, the text prompt to indicate the contents of the generated
speech, and the input to control the laughter expression, which can be either
the start and end times of laughter, or the additional audio prompt that
contains laughter to be mimicked. We develop our model based on the foundation
of conditional flow-matching-based zero-shot TTS, and fine-tune it with
frame-level representation from a laughter detector as additional conditioning.
With a simple scheme to mix small-scale laughter-conditioned data with
large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS
model can be readily fine-tuned to generate natural laughter with precise
controllability, without losing any quality of the pre-trained zero-shot TTS
model. Through the evaluations, we show that ELaTE can generate laughing speech
with significantly higher quality and controllability compared to conventional
models. See https://aka.ms/elate/ for demo samples.