重探語音情感識別中的建模與評估方法：考量標註者的主觀性與情感的模糊性

摘要

在過去二十年間，語音情感識別（SER）逐漸受到關注。為了訓練SER系統，研究人員收集了由眾包或內部評分者標註的情感語音數據庫，這些評分者從預定義的類別中選擇情感。然而，評分者之間的分歧是常見的。傳統方法將這些分歧視為噪聲，將標籤匯總為單一的共識目標。雖然這將SER簡化為單一標籤任務，但它忽略了人類情感感知的固有主觀性。本論文挑戰了這些假設，並提出以下問題：(1) 少數情感評分是否應被捨棄？(2) SER系統是否應僅從少數個體的感知中學習？(3) SER系統是否應僅預測每個樣本的一種情感？心理學研究表明，情感感知具有主觀性和模糊性，情感邊界相互重疊。我們提出了新的建模和評估視角：(1) 保留所有情感評分，並用軟標籤分佈來表示它們。基於個別評分者標註訓練的模型，並與標準SER系統聯合優化，在共識標註測試中表現更佳。(2) 重新定義SER評估，包括所有情感數據並允許情感共現（例如，悲傷和憤怒）。我們提出了一種「全包容規則」，匯總所有評分以最大化標籤表示的多樣性。在四個英語情感數據庫上的實驗顯示，其性能優於多數和複數標註。(3) 構建一個懲罰矩陣，在訓練過程中抑制不太可能的情感組合。將其整合到損失函數中進一步提升了性能。總體而言，接納少數評分、多個評分者以及多情感預測，能夠構建更為穩健且與人類感知一致的SER系統。

English

Over the past two decades, speech emotion recognition (SER) has received growing attention. To train SER systems, researchers collect emotional speech databases annotated by crowdsourced or in-house raters who select emotions from predefined categories. However, disagreements among raters are common. Conventional methods treat these disagreements as noise, aggregating labels into a single consensus target. While this simplifies SER as a single-label task, it ignores the inherent subjectivity of human emotion perception. This dissertation challenges such assumptions and asks: (1) Should minority emotional ratings be discarded? (2) Should SER systems learn from only a few individuals' perceptions? (3) Should SER systems predict only one emotion per sample? Psychological studies show that emotion perception is subjective and ambiguous, with overlapping emotional boundaries. We propose new modeling and evaluation perspectives: (1) Retain all emotional ratings and represent them with soft-label distributions. Models trained on individual annotator ratings and jointly optimized with standard SER systems improve performance on consensus-labeled tests. (2) Redefine SER evaluation by including all emotional data and allowing co-occurring emotions (e.g., sad and angry). We propose an ``all-inclusive rule'' that aggregates all ratings to maximize diversity in label representation. Experiments on four English emotion databases show superior performance over majority and plurality labeling. (3) Construct a penalization matrix to discourage unlikely emotion combinations during training. Integrating it into loss functions further improves performance. Overall, embracing minority ratings, multiple annotators, and multi-emotion predictions yields more robust and human-aligned SER systems.