MIKU-PAL：一种自动化、标准化的多模态语音副语言与情感标注方法

摘要

获取大规模且高度一致的情感语音数据仍然是语音合成领域的一大挑战。本文提出了MIKU-PAL，一种全自动的多模态流程，用于从未标记的视频数据中提取高一致性的情感语音。通过利用面部检测与追踪算法，我们开发了一套基于多模态大语言模型（MLLM）的自动情感分析系统。实验结果表明，MIKU-PAL能够达到人类级别的准确率（在MELD数据集上为68.5%）和卓越的一致性（Fleiss kappa分数为0.93），同时成本更低、速度更快。借助MIKU-PAL提供的高质量、灵活且一致的标注，我们能够标注多达26种细粒度的语音情感类别，经人类标注者验证，合理性评分达到83%。基于我们提出的系统，我们进一步发布了细粒度情感语音数据集MIKU-EmoBench（131.2小时），作为情感文本到语音及视觉语音克隆的新基准。

English

Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset MIKU-EmoBench(131.2 hours) as a new benchmark for emotional text-to-speech and visual voice cloning.

MIKU-PAL：一种自动化、标准化的多模态语音副语言与情感标注方法

MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling

摘要

Support