EmoNet-Voice:一個細粒度、專家驗證的語音情感檢測基準
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
June 11, 2025
作者: Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
cs.AI
摘要
文本轉語音和音頻生成模型的進步,亟需建立強大的基準來評估AI系統的情感理解能力。現有的語音情感識別(SER)數據集往往在情感細粒度、隱私問題或依賴於表演性呈現方面存在局限。本文介紹了EmoNet-Voice,這是一個用於語音情感檢測的新資源,包括EmoNet-Voice Big——一個大規模預訓練數據集(涵蓋超過4,500小時的語音,涉及11種聲音、40種情感和4種語言),以及EmoNet-Voice Bench——一個帶有人類專家註釋的新穎基準數據集。EmoNet-Voice旨在通過40種不同強度層次的情感類別,細緻地評估SER模型。利用最先進的語音生成技術,我們精心製作了模擬演員表演場景的合成音頻片段,旨在激發特定情感。關鍵的是,我們通過心理學專家進行了嚴格的驗證,他們為這些片段分配了感知強度標籤。這種合成且保護隱私的方法,使得能夠包含現有數據集中常缺失的敏感情感狀態。最後,我們介紹了Empathic Insight Voice模型,這些模型在語音情感識別方面設定了新標準,與人類專家達成了高度一致。我們對當前模型生態的評估揭示了有價值的發現,例如高喚醒情感(如憤怒)比低喚醒狀態(如專注)更容易被檢測到。
English
The advancement of text-to-speech and audio generation models necessitates
robust benchmarks for evaluating the emotional understanding capabilities of AI
systems. Current speech emotion recognition (SER) datasets often exhibit
limitations in emotional granularity, privacy concerns, or reliance on acted
portrayals. This paper introduces EmoNet-Voice, a new resource for speech
emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training
dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions,
and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human
expert annotations. EmoNet-Voice is designed to evaluate SER models on a
fine-grained spectrum of 40 emotion categories with different levels of
intensities. Leveraging state-of-the-art voice generation, we curated synthetic
audio snippets simulating actors portraying scenes designed to evoke specific
emotions. Crucially, we conducted rigorous validation by psychology experts who
assigned perceived intensity labels. This synthetic, privacy-preserving
approach allows for the inclusion of sensitive emotional states often absent in
existing datasets. Lastly, we introduce Empathic Insight Voice models that set
a new standard in speech emotion recognition with high agreement with human
experts. Our evaluations across the current model landscape exhibit valuable
findings, such as high-arousal emotions like anger being much easier to detect
than low-arousal states like concentration.