원하는 대로 웃을 수 있는 Flow-Matching 기반 제로샷 텍스트-투-스피치 만들기

초록

웃음은 인간의 언어 표현 중 가장 표현력이 풍부하고 자연스러운 요소 중 하나로, 감정, 사회적 신호, 유머를 전달합니다. 그러나 대부분의 텍스트-음성 변환(TTS) 시스템은 현실적이고 적절한 웃음 소리를 생성하는 능력이 부족하여, 그 응용 범위와 사용자 경험에 제약을 받고 있습니다. 기존의 자연스러운 웃음을 생성하려는 시도들이 있었지만, 생성할 웃음의 타이밍과 다양성을 제어하는 데 있어 한계가 있었습니다. 본 연구에서는 ELaTE를 제안합니다. ELaTE는 짧은 오디오 프롬프트를 기반으로 어떤 화자의 목소리 특성을 모방하고, 생성할 음성의 내용을 나타내는 텍스트 프롬프트, 그리고 웃음 표현을 제어하는 입력(웃음의 시작과 종료 시간 또는 모방할 웃음이 포함된 추가 오디오 프롬프트)을 활용하여 정확한 웃음 타이밍과 표현을 제어할 수 있는 제로샷 TTS 시스템입니다. 우리는 조건부 플로우 매칭 기반 제로샷 TTS를 기반으로 모델을 개발하고, 웃음 탐지기에서 추출한 프레임 수준의 표현을 추가 조건으로 사용하여 미세 조정했습니다. 소규모 웃음 조건 데이터와 대규모 사전 학습 데이터를 혼합하는 간단한 방식을 통해, 사전 학습된 제로샷 TTS 모델이 품질 저하 없이 정확한 제어 가능성을 갖춘 자연스러운 웃음을 생성하도록 미세 조정할 수 있음을 입증했습니다. 평가를 통해 ELaTE가 기존 모델에 비해 훨씬 더 높은 품질과 제어 가능성을 갖춘 웃음 음성을 생성할 수 있음을 보여줍니다. 데모 샘플은 https://aka.ms/elate/에서 확인할 수 있습니다.

English

Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing and variety of the laughter to be generated. In this work, we propose ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt with precise control of laughter timing and expression. Specifically, ELaTE works on the audio prompt to mimic the voice characteristic, the text prompt to indicate the contents of the generated speech, and the input to control the laughter expression, which can be either the start and end times of laughter, or the additional audio prompt that contains laughter to be mimicked. We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS, and fine-tune it with frame-level representation from a laughter detector as additional conditioning. With a simple scheme to mix small-scale laughter-conditioned data with large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS model can be readily fine-tuned to generate natural laughter with precise controllability, without losing any quality of the pre-trained zero-shot TTS model. Through the evaluations, we show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models. See https://aka.ms/elate/ for demo samples.

원하는 대로 웃을 수 있는 Flow-Matching 기반 제로샷 텍스트-투-스피치 만들기

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

초록

Support