EmoKnob:利用细粒度情绪控制增强语音克隆
EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control
October 1, 2024
作者: Haozhe Chen, Run Chen, Julia Hirschberg
cs.AI
摘要
尽管最近文本转语音(TTS)技术的进步实现了自然和富有表现力的语音,但缺乏用户选择情感和控制强度的选项。我们提出了EmoKnob,这是一个框架,允许在语音合成中对情感进行精细控制,只需少量演示任意情感的样本。我们的框架利用了最近基于基础语音克隆模型的表达性说话者表示空间。基于我们情感控制框架的少样本能力,我们提出了两种方法来在开放式文本描述的情感上应用情感控制,实现对多样微妙情感的直观界面控制。为促进更系统的情感语音合成领域,我们引入了一组旨在严格评估情感控制框架忠实度和可识别性的评估指标。通过客观和主观评估,我们展示了我们的情感控制框架有效地将情感嵌入语音,并超越了商业TTS服务的情感表达能力。
English
While recent advances in Text-to-Speech (TTS) technology produce natural and
expressive speech, they lack the option for users to select emotion and control
intensity. We propose EmoKnob, a framework that allows fine-grained emotion
control in speech synthesis with few-shot demonstrative samples of arbitrary
emotion. Our framework leverages the expressive speaker representation space
made possible by recent advances in foundation voice cloning models. Based on
the few-shot capability of our emotion control framework, we propose two
methods to apply emotion control on emotions described by open-ended text,
enabling an intuitive interface for controlling a diverse array of nuanced
emotions. To facilitate a more systematic emotional speech synthesis field, we
introduce a set of evaluation metrics designed to rigorously assess the
faithfulness and recognizability of emotion control frameworks. Through
objective and subjective evaluations, we show that our emotion control
framework effectively embeds emotions into speech and surpasses emotion
expressiveness of commercial TTS services.Summary
AI-Generated Summary