MIKU-PAL: 음성의 부언어적 특성 및 감정 레이블링을 위한 자동화된 표준화 다중 모드 방법론

초록

대규모 감정 음성 데이터를 높은 일관성으로 획득하는 것은 음성 합성 분야에서 여전히 과제로 남아 있다. 본 논문은 레이블이 없는 비디오 데이터에서 높은 일관성을 가진 감정 음성을 추출하기 위한 완전 자동화된 멀티모달 파이프라인인 MIKU-PAL을 제안한다. 얼굴 감지 및 추적 알고리즘을 활용하여, 우리는 멀티모달 대형 언어 모델(MLLM)을 사용한 자동 감정 분석 시스템을 개발하였다. 실험 결과, MIKU-PAL은 인간 수준의 정확도(MELD 기준 68.5%)와 우수한 일관성(Fleiss kappa 점수 0.93)을 달성할 수 있으며, 인간 주석보다 훨씬 저렴하고 빠르다는 것을 보여준다. MIKU-PAL의 고품질, 유연성, 일관성 있는 주석을 통해, 최대 26가지의 세분화된 음성 감정 범주를 주석할 수 있으며, 이는 인간 주석자에 의해 83%의 합리성 평가를 받았다. 우리가 제안한 시스템을 기반으로, 감정 텍스트-투-스피치 및 시각적 음성 복제를 위한 새로운 벤치마크로 세분화된 감정 음성 데이터셋 MIKU-EmoBench(131.2시간)을 공개하였다.

English

Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset MIKU-EmoBench(131.2 hours) as a new benchmark for emotional text-to-speech and visual voice cloning.

MIKU-PAL: 음성의 부언어적 특성 및 감정 레이블링을 위한 자동화된 표준화 다중 모드 방법론

MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling

초록

Support