マルコボイス技術報告書

要旨

本論文は、音声クローニングと感情制御音声合成を統合した多機能音声合成システムを提案する。本研究の目的は、多様な言語的・感情的文脈において話者同一性を忠実に保持しつつ、高度に表現力豊かで制御可能かつ自然な音声生成を実現するという長年の課題に取り組むことである。我々のアプローチでは、バッチ内コントラスティブ学習を用いた効果的な話者-感情分離メカニズムを導入し、話者同一性と感情スタイルの独立した操作を可能にするとともに、滑らかな感情制御のための回転的感情埋め込み統合手法を提案する。包括的な訓練と評価を支援するため、7つの感情カテゴリーにわたる6名のプロフェッショナル話者による10時間の中国語音声を含む高品質な感情音声データセットCSEMOTIONSを構築した。広範な実験により、我々のシステムMarco-Voiceが客観的および主観的指標の両方において大幅な改善を達成することが示された。包括的な評価と分析の結果、Marco-Voiceは音声の明瞭さと感情の豊かさの点で競争力のある性能を発揮し、表現力豊かなニューラル音声合成の分野における大きな進展を表していることが明らかとなった。

English

This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis.