重审语音情感识别中的建模与评估方法：考量标注者的主观性与情感的模糊性

摘要

在过去的二十年里，语音情感识别（SER）逐渐受到广泛关注。为训练SER系统，研究者们收集了由众包或内部评分者标注的情感语音数据库，这些评分者从预定义类别中选择情感。然而，评分者之间的分歧十分常见。传统方法将这些分歧视为噪声，通过聚合标签形成单一共识目标。虽然这简化了SER为单标签任务，却忽视了人类情感感知固有的主观性。本论文挑战了这些假设，并提出以下问题：(1) 少数情感评分是否应被舍弃？(2) SER系统是否应仅基于少数个体的感知进行学习？(3) SER系统是否应仅预测每个样本的一种情感？心理学研究表明，情感感知具有主观性和模糊性，情感边界存在重叠。我们提出了新的建模与评估视角：(1) 保留所有情感评分，并用软标签分布表示。基于个体标注者评分训练模型，并与标准SER系统联合优化，在共识标签测试中提升了性能。(2) 重新定义SER评估，纳入所有情感数据并允许情感共存（如悲伤与愤怒）。我们提出“全包含规则”，聚合所有评分以最大化标签表示的多样性。在四个英语情感数据库上的实验显示，其性能优于多数票和相对多数票标注。(3) 构建惩罚矩阵，在训练中抑制不常见的情感组合。将其融入损失函数进一步提升了性能。总体而言，接纳少数评分、多标注者及多情感预测，能够构建出更稳健且与人类感知一致的SER系统。

English

Over the past two decades, speech emotion recognition (SER) has received growing attention. To train SER systems, researchers collect emotional speech databases annotated by crowdsourced or in-house raters who select emotions from predefined categories. However, disagreements among raters are common. Conventional methods treat these disagreements as noise, aggregating labels into a single consensus target. While this simplifies SER as a single-label task, it ignores the inherent subjectivity of human emotion perception. This dissertation challenges such assumptions and asks: (1) Should minority emotional ratings be discarded? (2) Should SER systems learn from only a few individuals' perceptions? (3) Should SER systems predict only one emotion per sample? Psychological studies show that emotion perception is subjective and ambiguous, with overlapping emotional boundaries. We propose new modeling and evaluation perspectives: (1) Retain all emotional ratings and represent them with soft-label distributions. Models trained on individual annotator ratings and jointly optimized with standard SER systems improve performance on consensus-labeled tests. (2) Redefine SER evaluation by including all emotional data and allowing co-occurring emotions (e.g., sad and angry). We propose an ``all-inclusive rule'' that aggregates all ratings to maximize diversity in label representation. Experiments on four English emotion databases show superior performance over majority and plurality labeling. (3) Construct a penalization matrix to discourage unlikely emotion combinations during training. Integrating it into loss functions further improves performance. Overall, embracing minority ratings, multiple annotators, and multi-emotion predictions yields more robust and human-aligned SER systems.