基於流形匹配的零-shot 文字轉語音:隨心所欲笑出來
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
February 12, 2024
作者: Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Canrun Li, Steven Tsai, Zhen Xiao, Yufei Xia, Jinzhu Li, Yanqing Liu, Sheng Zhao, Michael Zeng
cs.AI
摘要
笑聲是人類語音中最具表達力和自然的一個方面,傳達情感、社交暗示和幽默。然而,大多數文本轉語音(TTS)系統缺乏產生逼真和適當笑聲的能力,限制了其應用和使用者體驗。雖然先前有一些工作致力於生成自然笑聲,但在控制笑聲的時機和變化方面存在不足。在本研究中,我們提出ELaTE,一種零-shot TTS,可以基於短音頻提示生成任何說話者的自然笑聲語音,並精確控制笑聲的時機和表達。具體而言,ELaTE通過音頻提示模仿聲音特徵,文本提示指示生成語音的內容,輸入控制笑聲表達,可以是笑聲的開始和結束時間,或包含要模仿的笑聲的附加音頻提示。我們基於條件流匹配的零-shot TTS基礎開發我們的模型,並通過笑聲檢測器的幀級表示進行微調作為額外條件。通過將小規模笑聲條件數據與大規模預訓練數據簡單混合的方案,我們展示了預訓練的零-shot TTS模型可以輕鬆進行微調,以生成具有精確可控性的自然笑聲,而不會喪失預訓練的零-shot TTS模型的質量。通過評估,我們展示ELaTE可以生成笑聲語音,其質量和可控性明顯優於傳統模型。請參見https://aka.ms/elate/ 以查看演示樣本。
English
Laughter is one of the most expressive and natural aspects of human speech,
conveying emotions, social cues, and humor. However, most text-to-speech (TTS)
systems lack the ability to produce realistic and appropriate laughter sounds,
limiting their applications and user experience. While there have been prior
works to generate natural laughter, they fell short in terms of controlling the
timing and variety of the laughter to be generated. In this work, we propose
ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker
based on a short audio prompt with precise control of laughter timing and
expression. Specifically, ELaTE works on the audio prompt to mimic the voice
characteristic, the text prompt to indicate the contents of the generated
speech, and the input to control the laughter expression, which can be either
the start and end times of laughter, or the additional audio prompt that
contains laughter to be mimicked. We develop our model based on the foundation
of conditional flow-matching-based zero-shot TTS, and fine-tune it with
frame-level representation from a laughter detector as additional conditioning.
With a simple scheme to mix small-scale laughter-conditioned data with
large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS
model can be readily fine-tuned to generate natural laughter with precise
controllability, without losing any quality of the pre-trained zero-shot TTS
model. Through the evaluations, we show that ELaTE can generate laughing speech
with significantly higher quality and controllability compared to conventional
models. See https://aka.ms/elate/ for demo samples.