MIKU-PAL: 音声パラ言語情報と感情ラベリングのための自動化・標準化されたマルチモーダル手法

要旨

大規模で一貫性の高い感情音声データの取得は、音声合成において依然として課題となっています。本論文では、ラベルなしの動画データから高品質で一貫性のある感情音声を抽出するための完全自動化されたマルチモーダルパイプライン「MIKU-PAL」を提案します。顔検出と追跡アルゴリズムを活用し、マルチモーダル大規模言語モデル（MLLM）を用いた自動感情分析システムを開発しました。その結果、MIKU-PALは人間レベルの精度（MELDで68.5%）と優れた一貫性（Fleiss kappaスコア0.93）を達成しつつ、人間によるアノテーションよりもはるかに低コストかつ高速であることを実証しました。MIKU-PALによる高品質で柔軟かつ一貫性のあるアノテーションにより、最大26種類の細粒度な音声感情カテゴリをアノテーションすることが可能であり、人間のアノテーターによる83%の合理性評価を得ています。提案システムに基づき、さらに細粒度な感情音声データセット「MIKU-EmoBench」（131.2時間）を公開し、感情テキスト読み上げおよび視覚的音声クローニングの新たなベンチマークとして提供します。

English

Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset MIKU-EmoBench(131.2 hours) as a new benchmark for emotional text-to-speech and visual voice cloning.

MIKU-PAL: 音声パラ言語情報と感情ラベリングのための自動化・標準化されたマルチモーダル手法

MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling

要旨

Support