마르코 보이스 기술 보고서

초록

본 논문은 음성 복제와 감정 제어 음성 합성을 통합된 프레임워크 내에서 구현한 다기능 음성 합성 시스템을 제안한다. 이 연구의 목표는 다양한 언어적, 감정적 맥락에서 화자 정체성을 충실히 보존하면서도 높은 표현력과 제어 가능성, 자연스러움을 갖춘 음성 생성을 달성하는 데 오랫동안 존재해 온 과제를 해결하는 것이다. 우리의 접근 방식은 효과적인 화자-감정 분리 메커니즘과 인배치 대조 학습을 도입하여 화자 정체성과 감정 스타일을 독립적으로 조작할 수 있도록 하며, 부드러운 감정 제어를 위한 회전 감정 임베딩 통합 방법을 제안한다. 포괄적인 학습과 평가를 지원하기 위해, 우리는 6명의 전문 화자가 7가지 감정 범주로 발화한 10시간 분량의 고품질 중국어 감정 음성 데이터셋인 CSEMOTIONS를 구축했다. 광범위한 실험을 통해 우리의 시스템인 Marco-Voice가 객관적 및 주관적 지표 모두에서 상당한 개선을 달성했음을 입증했다. 포괄적인 평가와 분석 결과, Marco-Voice는 음성 명료성과 감정 풍부성 측면에서 경쟁력 있는 성능을 보여주며, 표현적 신경 음성 합성 분야에서 상당한 진전을 이루었음을 확인했다.

English

This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis.