EmoNet-Voice：一個細粒度、專家驗證的語音情感檢測基準

摘要

文本轉語音和音頻生成模型的進步，亟需建立強大的基準來評估AI系統的情感理解能力。現有的語音情感識別（SER）數據集往往在情感細粒度、隱私問題或依賴於表演性呈現方面存在局限。本文介紹了EmoNet-Voice，這是一個用於語音情感檢測的新資源，包括EmoNet-Voice Big——一個大規模預訓練數據集（涵蓋超過4,500小時的語音，涉及11種聲音、40種情感和4種語言），以及EmoNet-Voice Bench——一個帶有人類專家註釋的新穎基準數據集。EmoNet-Voice旨在通過40種不同強度層次的情感類別，細緻地評估SER模型。利用最先進的語音生成技術，我們精心製作了模擬演員表演場景的合成音頻片段，旨在激發特定情感。關鍵的是，我們通過心理學專家進行了嚴格的驗證，他們為這些片段分配了感知強度標籤。這種合成且保護隱私的方法，使得能夠包含現有數據集中常缺失的敏感情感狀態。最後，我們介紹了Empathic Insight Voice模型，這些模型在語音情感識別方面設定了新標準，與人類專家達成了高度一致。我們對當前模型生態的評估揭示了有價值的發現，例如高喚醒情感（如憤怒）比低喚醒狀態（如專注）更容易被檢測到。

English

The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.

EmoNet-Voice：一個細粒度、專家驗證的語音情感檢測基準

EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

摘要

Support