音声感情認識におけるモデリングと評価手法の再考：アノテーターの主観性と感情の曖昧性を考慮して

要旨

過去20年間にわたり、音声感情認識（Speech Emotion Recognition, SER）はますます注目を集めてきた。SERシステムを訓練するために、研究者は事前に定義された感情カテゴリーから感情を選択するクラウドソーシングまたは社内評価者によって注釈付けされた感情音声データベースを収集する。しかし、評価者間の不一致は一般的である。従来の手法では、これらの不一致をノイズとして扱い、ラベルを単一の合意目標に集約する。これにより、SERを単一ラベルタスクとして簡素化するが、人間の感情知覚の内在的な主観性を無視している。本論文はこのような前提に挑戦し、以下の問いを投げかける：（1）少数派の感情評価は捨てるべきか？（2）SERシステムは少数の個人の知覚からのみ学習すべきか？（3）SERシステムはサンプルごとに1つの感情のみを予測すべきか？心理学的研究によれば、感情知覚は主観的で曖昧であり、感情の境界が重複している。我々は新しいモデリングと評価の視点を提案する：（1）すべての感情評価を保持し、ソフトラベル分布として表現する。個々の評価者の評価に基づいて訓練されたモデルは、標準的なSERシステムと共同で最適化されることで、合意ラベル付けされたテストにおいて性能を向上させる。（2）SER評価を再定義し、すべての感情データを含め、共起する感情（例：悲しみと怒り）を許容する。我々は、ラベル表現の多様性を最大化するためにすべての評価を集約する「包括的ルール」を提案する。4つの英語感情データベースでの実験は、多数派および複数派ラベル付けを上回る性能を示す。（3）訓練中にあり得ない感情の組み合わせを抑制するためのペナルティ行列を構築する。これを損失関数に統合することで、さらに性能が向上する。全体として、少数派の評価、複数の評価者、および複数感情予測を取り入れることで、よりロバストで人間に沿ったSERシステムが得られる。

English

Over the past two decades, speech emotion recognition (SER) has received growing attention. To train SER systems, researchers collect emotional speech databases annotated by crowdsourced or in-house raters who select emotions from predefined categories. However, disagreements among raters are common. Conventional methods treat these disagreements as noise, aggregating labels into a single consensus target. While this simplifies SER as a single-label task, it ignores the inherent subjectivity of human emotion perception. This dissertation challenges such assumptions and asks: (1) Should minority emotional ratings be discarded? (2) Should SER systems learn from only a few individuals' perceptions? (3) Should SER systems predict only one emotion per sample? Psychological studies show that emotion perception is subjective and ambiguous, with overlapping emotional boundaries. We propose new modeling and evaluation perspectives: (1) Retain all emotional ratings and represent them with soft-label distributions. Models trained on individual annotator ratings and jointly optimized with standard SER systems improve performance on consensus-labeled tests. (2) Redefine SER evaluation by including all emotional data and allowing co-occurring emotions (e.g., sad and angry). We propose an ``all-inclusive rule'' that aggregates all ratings to maximize diversity in label representation. Experiments on four English emotion databases show superior performance over majority and plurality labeling. (3) Construct a penalization matrix to discourage unlikely emotion combinations during training. Integrating it into loss functions further improves performance. Overall, embracing minority ratings, multiple annotators, and multi-emotion predictions yields more robust and human-aligned SER systems.

音声感情認識におけるモデリングと評価手法の再考：アノテーターの主観性と感情の曖昧性を考慮して

Revisiting Modeling and Evaluation Approaches in Speech Emotion Recognition: Considering Subjectivity of Annotators and Ambiguity of Emotions

要旨

Support