Marco-Voice 技术报告
Marco-Voice Technical Report
August 4, 2025
作者: Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang
cs.AI
摘要
本文提出了一种多功能语音合成系统,该系统在统一框架内集成了语音克隆与情感控制语音合成技术。本研究的核心目标是解决在实现高度表现力、可控且自然的语音生成过程中长期存在的挑战,确保在不同语言和情感情境下忠实保留说话者身份。我们的方法引入了一种有效的说话者-情感解耦机制,结合批量对比学习,实现了说话者身份与情感风格的独立操控,以及旋转情感嵌入整合方法,以实现平滑的情感控制。为支持全面的训练与评估,我们构建了CSEMOTIONS数据集,这是一个高质量的情感语音数据集,包含六位专业说话者跨越七种情感类别的10小时普通话语音。大量实验表明,我们的系统Marco-Voice在客观与主观评价指标上均取得了显著提升。全面的评估与分析结果显示,Marco-Voice在语音清晰度与情感丰富度方面展现出竞争力,标志着表达性神经语音合成领域的一大进步。
English
This paper presents a multifunctional speech synthesis system that integrates
voice cloning and emotion control speech synthesis within a unified framework.
The goal of this work is to address longstanding challenges in achieving highly
expressive, controllable, and natural speech generation that faithfully
preserves speaker identity across diverse linguistic and emotional contexts.
Our approach introduces an effective speaker-emotion disentanglement mechanism
with in-batch contrastive learning, enabling independent manipulation of
speaker identity and eemotional style, as well as rotational emotional
embedding integration method for smooth emotion control. To support
comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality
emotional speech dataset containing 10 hours of Mandarin speech from six
professional speakers across seven emotional categories. Extensive experiments
demonstrate that our system, Marco-Voice, achieves substantial improvements in
both objective and subjective metrics. Comprehensive evaluations and analysis
were conducted, results show that MarcoVoice delivers competitive performance
in terms of speech clarity and emotional richness, representing a substantial
advance in the field of expressive neural speech synthesis.