EmoNet-Voice: 音声感情検出のための細粒度かつ専門家検証済みベンチマーク

要旨

テキスト読み上げおよび音声生成モデルの進展に伴い、AIシステムの感情理解能力を評価するための堅牢なベンチマークの必要性が高まっている。現在の音声感情認識（SER）データセットは、感情の粒度、プライバシーに関する懸念、または演技に依存している点でしばしば制約がある。本論文では、音声感情検出のための新たなリソースであるEmoNet-Voiceを紹介する。EmoNet-Voiceは、大規模な事前学習データセット「EmoNet-Voice Big」（11の声、40の感情、4つの言語にわたる4,500時間以上の音声を含む）と、人間の専門家による注釈を付けた新たなベンチマークデータセット「EmoNet-Voice Bench」で構成されている。EmoNet-Voiceは、40の感情カテゴリーにわたる細かいスペクトルと異なる強度レベルでSERモデルを評価するために設計されている。最先端の音声生成技術を活用し、特定の感情を引き出すように設計されたシーンを演じる俳優をシミュレートした合成音声クリップをキュレーションした。重要な点として、心理学の専門家による厳密な検証を行い、知覚された強度ラベルを付与した。この合成かつプライバシー保護を考慮したアプローチにより、既存のデータセットではしばしば欠如している敏感な感情状態を含めることが可能となった。最後に、人間の専門家との高い一致を示す音声感情認識の新たな基準を設定するEmpathic Insight Voiceモデルを紹介する。現在のモデル環境における評価を通じて、怒りのような高覚醒感情が集中のような低覚醒状態よりもはるかに検出しやすいといった貴重な知見が得られた。

English

The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.

EmoNet-Voice: 音声感情検出のための細粒度かつ専門家検証済みベンチマーク

EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

要旨

Support