馬可語音技術報告
Marco-Voice Technical Report
August 4, 2025
作者: Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang
cs.AI
摘要
本文提出了一種多功能語音合成系統,該系統在統一框架內整合了語音克隆與情感控制語音合成技術。本研究旨在解決長期以來在實現高度表現力、可控且自然的語音生成方面所面臨的挑戰,確保在不同語言及情感背景下忠實保留說話者身份。我們的方法引入了一種有效的說話者-情感解耦機制,結合批次內對比學習,實現了對說話者身份與情感風格的獨立操控,並採用了旋轉情感嵌入整合方法以實現平滑的情感控制。為支持全面的訓練與評估,我們構建了CSEMOTIONS,這是一個高質量的情感語音數據集,包含來自六位專業說話者、跨越七種情感類別的十小時普通話語音。大量實驗表明,我們的系統Marco-Voice在客觀與主觀指標上均取得了顯著提升。全面的評估與分析結果顯示,Marco-Voice在語音清晰度與情感豐富度方面展現出競爭力,標誌著表現力神經語音合成領域的一大進步。
English
This paper presents a multifunctional speech synthesis system that integrates
voice cloning and emotion control speech synthesis within a unified framework.
The goal of this work is to address longstanding challenges in achieving highly
expressive, controllable, and natural speech generation that faithfully
preserves speaker identity across diverse linguistic and emotional contexts.
Our approach introduces an effective speaker-emotion disentanglement mechanism
with in-batch contrastive learning, enabling independent manipulation of
speaker identity and eemotional style, as well as rotational emotional
embedding integration method for smooth emotion control. To support
comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality
emotional speech dataset containing 10 hours of Mandarin speech from six
professional speakers across seven emotional categories. Extensive experiments
demonstrate that our system, Marco-Voice, achieves substantial improvements in
both objective and subjective metrics. Comprehensive evaluations and analysis
were conducted, results show that MarcoVoice delivers competitive performance
in terms of speech clarity and emotional richness, representing a substantial
advance in the field of expressive neural speech synthesis.