ChatPaper.aiChatPaper

EmoNet-Voice:一个细粒度、专家验证的语音情感检测基准

EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

June 11, 2025
作者: Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
cs.AI

摘要

随着文本转语音和音频生成模型的进步,亟需建立强大的基准来评估AI系统在情感理解方面的能力。现有的语音情感识别(SER)数据集往往在情感粒度、隐私问题或依赖表演呈现方面存在局限。本文介绍了EmoNet-Voice,一个用于语音情感检测的新资源,其中包括EmoNet-Voice Big——一个大规模预训练数据集(涵盖超过4,500小时的语音,涉及11种声音、40种情感和4种语言),以及EmoNet-Voice Bench——一个带有人类专家标注的新颖基准数据集。EmoNet-Voice旨在通过40种情感类别及其不同强度水平的细粒度谱系来评估SER模型。利用最先进的语音生成技术,我们精心制作了模拟演员表演场景的合成音频片段,旨在引发特定情感。关键的是,我们通过心理学专家进行了严格的验证,他们为这些片段分配了感知强度标签。这种合成且保护隐私的方法,使得包含现有数据集中常缺失的敏感情感状态成为可能。最后,我们引入了Empathic Insight Voice模型,这些模型在语音情感识别方面树立了新标准,与人类专家高度一致。我们在当前模型生态中的评估揭示了有价值的发现,例如高唤醒度的情感(如愤怒)比低唤醒度的状态(如专注)更容易被检测到。
English
The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.
PDF132June 20, 2025